[DRBD-user] Mirrored partition locked with network jitter

Fri Nov 15 11:05:02 CET 2013

Hi all,

We are trying to use DRBD over a WAN. To simulate a LAN, we use an L2TP tunnel between primary and secondary nodes.

The current setup uses a WAN emulator on the tunnel to emulate some network constraints.

With incremental synchronization, that works, but we have issues when we try to trigger a full synchronization with:

> drbdadm invalidate r0

When we add some delay to the traffic (up to 60ms) everything works fine, but as soon as we add some jitter, even a small one (2ms), the mirrored partition gets locked and does not answer to monitoring after a few seconds.

The system tries to force a switch-over, but sometimes fails and we have to wait for the end of the full synchronization.

We use DRBD 8.3.11 with a 3.2.0-49 kernel (ubuntu 12.04)

Do you have some pointers?

Thanks a lot.

Jérôme

PS:

Here is our drbd configuration:

global {

        usage-count no;

}

common {

        protocol B;

        handlers {

                initial-split-brain "/p25/bin/drbd-notify-initial-split-brain.sh";

                split-brain "/p25/bin/drbd-notify-split-brain.sh            ; /p25/bin/drbd-notify-emergency-reboot.sh  ; echo b > /proc/sysrq-trigger ; reboot -f";

                pri-lost-after-sb "/p25/bin/drbd-notify-pri-lost-after-sb.sh; /p25/bin/drbd-notify-emergency-reboot.sh  ; echo b > /proc/sysrq-trigger ; reboot -f";

                pri-on-incon-degr "/p25/bin/drbd-notify-pri-on-incon-degr.sh; /p25/bin/drbd-notify-emergency-reboot.sh  ; echo b > /proc/sysrq-trigger ; reboot -f";

                pri-lost "/p25/bin/drbd-notify-pri-lost.sh                  ; /p25/bin/drbd-notify-emergency-reboot.sh  ; echo b > /proc/sysrq-trigger ; reboot -f";

                out-of-sync "/p25/bin/drbd-notify-out-of-sync.sh            ; /p25/bin/drbd-notify-emergency-reboot.sh  ; echo b > /proc/sysrq-trigger ; reboot -f";

                local-io-error "/p25/bin/drbd-notify-io-error.sh            ; /p25/bin/drbd-notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f";

        }

        startup {

        }

        disk {

        }

        net {

                after-sb-0pri discard-least-changes;

                after-sb-1pri discard-secondary;

                after-sb-2pri call-pri-lost-after-sb;

                rr-conflict call-pri-lost;

                max-buffers 8000;

                max-epoch-size 8000;

                sndbuf-size 0;

        }

        syncer {

                rate 10M;

                verify-alg sha1;

        }

}

I don't see any drbd errors in the logs:

Full sync starting:

Nov  8 15:49:08 localhost kernel: [88504.170917] block drbd0: conn( Connected -> StartingSyncS ) pdsk( UpToDate -> Consistent )

Nov  8 15:49:08 localhost kernel: [88504.186884] block drbd0: bitmap WRITE of 4 pages took 0 jiffies

Nov  8 15:49:08 localhost kernel: [88504.191755] block drbd0: 510 MB (130515 bits) marked out-of-sync by on disk bit-map.

Nov  8 15:49:08 localhost kernel: [88504.198156] block drbd0: helper command: /sbin/drbdadm before-resync-source minor-0

Nov  8 15:49:08 localhost kernel: [88504.201237] block drbd0: helper command: /sbin/drbdadm before-resync-source minor-0 exit code 0 (0x0)

Nov  8 15:49:08 localhost kernel: [88504.201248] block drbd0: conn( StartingSyncS -> SyncSource ) pdsk( Consistent -> Inconsistent )

Nov  8 15:49:08 localhost kernel: [88504.201256] block drbd0: Began resync as SyncSource (will sync 522060 KB [130515 bits set]).

Nov  8 15:49:08 localhost kernel: [88504.209902] block drbd0: updated sync UUID 0AB4A06E64296649:0001000000000000:0001000000000001:5662C47AC0552870

And stopping:

Nov  8 15:52:21 localhost kernel: [88697.310502] block drbd0: Resync done (total 193 sec; paused 0 sec; 2704 K/sec)

Nov  8 15:52:21 localhost kernel: [88697.310513] block drbd0: updated UUIDs 0AB4A06E64296649:0000000000000000:0001000000000000:0001000000000001

Nov  8 15:52:21 localhost kernel: [88697.310524] block drbd0: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate )

Nov  8 15:52:21 localhost kernel: [88697.365073] block drbd0: bitmap WRITE of 0 pages took 0 jiffies

Nov  8 15:52:21 localhost kernel: [88697.366858] block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map.

But FS and other errors:

Nov  8 15:49:12 localhost pengine: [964]: notice: unpack_rsc_op: Ignoring expired failure mirroredFS_monitor_15000 (rc=-2, magic=2:-2;14:293:0:e71d3650-2904-430b-90ce-db6f7cdd8d0e) on 19a21328-51e2-4130-bc85-c7e779598bf4

Nov  8 15:49:12 localhost pengine: [964]: WARN: unpack_rsc_op: Processing failed op bcss_last_failure_0 on 19a21328-51e2-4130-bc85-c7e779598bf4: unknown error (1)

Nov  8 15:49:58 localhost crmd: [965]: ERROR: process_lrm_event: LRM operation mirroredFS_monitor_15000 (61) Timed Out (timeout=40000ms)

Nov  8 15:49:58 localhost crmd: [965]: info: process_graph_event: Detected action mirroredFS_monitor_15000 from a different transition: 293 vs. 1510

Nov  8 15:49:58 localhost pengine: [964]: WARN: unpack_rsc_op: Processing failed op bcss_last_failure_0 on 19a21328-51e2-4130-bc85-c7e779598bf4: unknown error (1)

Nov  8 15:49:58 localhost attrd: [963]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-mirroredFS (1383925798)

Nov  8 15:50:58 localhost crmd: [965]: info: send_direct_ack: ACK'ing resource op bcss_fail_60000 from 0:0:crm-resource-27411: lrm_invoke-lrmd-1383925858-1701

Nov  8 15:50:58 localhost crmd: [965]: info: process_lrm_event: LRM operation bcss_asyncmon_0 (call=70, rc=1, cib-update=1677, confirmed=false) unknown error

Nov  8 15:50:58 localhost crmd: [965]: ERROR: process_graph_event: Action bcss_asyncmon_0 (0:1;70:-1:0:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx) initiated outside of a transition

Nov  8 15:50:58 localhost crmd: [965]: info: abort_transition_graph: process_graph_event:474 - Triggered transition abort (complete=1, tag=lrm_rsc_op, id=bcss_last_failure_0, magic=0:1;70:-1:0:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx, cib=0.323.426) : Unexpected event

Nov  8 15:50:58 localhost crmd: [965]: WARN: update_failcount: Updating failcount for bcss on 19a21328-51e2-4130-bc85-c7e779598bf4 after failed asyncmon: rc=1 (update=value++, time=1383925858)

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20131115/a62df78d/attachment.htm>