Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi all, We are trying to use DRBD over a WAN. To simulate a LAN, we use an L2TP tunnel between primary and secondary nodes. The current setup uses a WAN emulator on the tunnel to emulate some network constraints. With incremental synchronization, that works, but we have issues when we try to trigger a full synchronization with: > drbdadm invalidate r0 When we add some delay to the traffic (up to 60ms) everything works fine, but as soon as we add some jitter, even a small one (2ms), the mirrored partition gets locked and does not answer to monitoring after a few seconds. The system tries to force a switch-over, but sometimes fails and we have to wait for the end of the full synchronization. We use DRBD 8.3.11 with a 3.2.0-49 kernel (ubuntu 12.04) Do you have some pointers? Thanks a lot. Jérôme PS: Here is our drbd configuration: global { usage-count no; } common { protocol B; handlers { initial-split-brain "/p25/bin/drbd-notify-initial-split-brain.sh"; split-brain "/p25/bin/drbd-notify-split-brain.sh ; /p25/bin/drbd-notify-emergency-reboot.sh ; echo b > /proc/sysrq-trigger ; reboot -f"; pri-lost-after-sb "/p25/bin/drbd-notify-pri-lost-after-sb.sh; /p25/bin/drbd-notify-emergency-reboot.sh ; echo b > /proc/sysrq-trigger ; reboot -f"; pri-on-incon-degr "/p25/bin/drbd-notify-pri-on-incon-degr.sh; /p25/bin/drbd-notify-emergency-reboot.sh ; echo b > /proc/sysrq-trigger ; reboot -f"; pri-lost "/p25/bin/drbd-notify-pri-lost.sh ; /p25/bin/drbd-notify-emergency-reboot.sh ; echo b > /proc/sysrq-trigger ; reboot -f"; out-of-sync "/p25/bin/drbd-notify-out-of-sync.sh ; /p25/bin/drbd-notify-emergency-reboot.sh ; echo b > /proc/sysrq-trigger ; reboot -f"; local-io-error "/p25/bin/drbd-notify-io-error.sh ; /p25/bin/drbd-notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f"; } startup { } disk { } net { after-sb-0pri discard-least-changes; after-sb-1pri discard-secondary; after-sb-2pri call-pri-lost-after-sb; rr-conflict call-pri-lost; max-buffers 8000; max-epoch-size 8000; sndbuf-size 0; } syncer { rate 10M; verify-alg sha1; } } I don't see any drbd errors in the logs: Full sync starting: Nov 8 15:49:08 localhost kernel: [88504.170917] block drbd0: conn( Connected -> StartingSyncS ) pdsk( UpToDate -> Consistent ) Nov 8 15:49:08 localhost kernel: [88504.186884] block drbd0: bitmap WRITE of 4 pages took 0 jiffies Nov 8 15:49:08 localhost kernel: [88504.191755] block drbd0: 510 MB (130515 bits) marked out-of-sync by on disk bit-map. Nov 8 15:49:08 localhost kernel: [88504.198156] block drbd0: helper command: /sbin/drbdadm before-resync-source minor-0 Nov 8 15:49:08 localhost kernel: [88504.201237] block drbd0: helper command: /sbin/drbdadm before-resync-source minor-0 exit code 0 (0x0) Nov 8 15:49:08 localhost kernel: [88504.201248] block drbd0: conn( StartingSyncS -> SyncSource ) pdsk( Consistent -> Inconsistent ) Nov 8 15:49:08 localhost kernel: [88504.201256] block drbd0: Began resync as SyncSource (will sync 522060 KB [130515 bits set]). Nov 8 15:49:08 localhost kernel: [88504.209902] block drbd0: updated sync UUID 0AB4A06E64296649:0001000000000000:0001000000000001:5662C47AC0552870 And stopping: Nov 8 15:52:21 localhost kernel: [88697.310502] block drbd0: Resync done (total 193 sec; paused 0 sec; 2704 K/sec) Nov 8 15:52:21 localhost kernel: [88697.310513] block drbd0: updated UUIDs 0AB4A06E64296649:0000000000000000:0001000000000000:0001000000000001 Nov 8 15:52:21 localhost kernel: [88697.310524] block drbd0: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate ) Nov 8 15:52:21 localhost kernel: [88697.365073] block drbd0: bitmap WRITE of 0 pages took 0 jiffies Nov 8 15:52:21 localhost kernel: [88697.366858] block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map. But FS and other errors: Nov 8 15:49:12 localhost pengine: [964]: notice: unpack_rsc_op: Ignoring expired failure mirroredFS_monitor_15000 (rc=-2, magic=2:-2;14:293:0:e71d3650-2904-430b-90ce-db6f7cdd8d0e) on 19a21328-51e2-4130-bc85-c7e779598bf4 Nov 8 15:49:12 localhost pengine: [964]: WARN: unpack_rsc_op: Processing failed op bcss_last_failure_0 on 19a21328-51e2-4130-bc85-c7e779598bf4: unknown error (1) Nov 8 15:49:58 localhost crmd: [965]: ERROR: process_lrm_event: LRM operation mirroredFS_monitor_15000 (61) Timed Out (timeout=40000ms) Nov 8 15:49:58 localhost crmd: [965]: info: process_graph_event: Detected action mirroredFS_monitor_15000 from a different transition: 293 vs. 1510 Nov 8 15:49:58 localhost pengine: [964]: WARN: unpack_rsc_op: Processing failed op bcss_last_failure_0 on 19a21328-51e2-4130-bc85-c7e779598bf4: unknown error (1) Nov 8 15:49:58 localhost attrd: [963]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-mirroredFS (1383925798) Nov 8 15:50:58 localhost crmd: [965]: info: send_direct_ack: ACK'ing resource op bcss_fail_60000 from 0:0:crm-resource-27411: lrm_invoke-lrmd-1383925858-1701 Nov 8 15:50:58 localhost crmd: [965]: info: process_lrm_event: LRM operation bcss_asyncmon_0 (call=70, rc=1, cib-update=1677, confirmed=false) unknown error Nov 8 15:50:58 localhost crmd: [965]: ERROR: process_graph_event: Action bcss_asyncmon_0 (0:1;70:-1:0:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx) initiated outside of a transition Nov 8 15:50:58 localhost crmd: [965]: info: abort_transition_graph: process_graph_event:474 - Triggered transition abort (complete=1, tag=lrm_rsc_op, id=bcss_last_failure_0, magic=0:1;70:-1:0:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx, cib=0.323.426) : Unexpected event Nov 8 15:50:58 localhost crmd: [965]: WARN: update_failcount: Updating failcount for bcss on 19a21328-51e2-4130-bc85-c7e779598bf4 after failed asyncmon: rc=1 (update=value++, time=1383925858) -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20131115/a62df78d/attachment.htm>