[DRBD-user] DRBD device stalled after reconnection

Maros Timko timkom at gmail.com
Thu Feb 5 15:25:54 CET 2009

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hi all,

we are running Xen VMs on top of DRBD, DRBD resources are defined on top of
LVMs. We use 64-bit CentOS 5.2 (2.6.18-92.1.22.el5xen). Previously we were
testing the setup with DRBD RPMs from CentOS distribution (8.2.6-3), but we
met an issue: device on top of which still runs Xen VM at the time of DRBD
communication path is broken (we just removed dedicated crossover cable for
simple tests) for some time, stalled at the sync progress at 100% after
reconnection. This was easily reproducible and the more changes occured on
the device when disconnected the higher probability of the stalling. We use
synchronuous resync definition (using "after" config) so it means for us
that all the followers are stuck in PausedSync states with inconsistent data
state. Reconnection of this device solves the issue, however, there is no
handler for such situations and devices itself looks happy (syncing although
at 100%).
So we tried to upgrade to DRBD 8.2.7 (GIT-hash:
61b7f4c2fc34fe3d2acf7be6bcc1fc2684708a7d) - it seemed like this release
solved such issue. However, we still experience this, although not so often
and the behaviour is different - device get stalled at e.g. 25% and then the
number decreases. This is I think because still new changes are coming so
the update of statistics gives such results.

I tried to look for stalling issues on the list but seems like there is no
definite answer. If anyone has an experience with some kind of information
on how to prevent such issues, it would be great. Most of the issues what I
saw were related to network quality or huge amount of data that needs to be
resynced. But we are trying simply plug out the cable.

I am enclosing dump of related device only, all others are exactly the
same excepting LVMs ... and corresponding /var/log/messages section.

# drbdsetup /dev/drbd1 show
disk {
        size                    0s _is_default; # bytes
        on-io-error             detach;
        fencing                 resource-only;
        max-bio-bvecs           0 _is_default;
}
net {
        timeout                 60 _is_default; # 1/10 seconds
        max-epoch-size          512;
        max-buffers             512;
        unplug-watermark        128 _is_default;
        connect-int             2; # seconds
        ping-int                2; # seconds
        sndbuf-size             0; # bytes
        ko-count                0 _is_default;
        cram-hmac-alg           "sha1";
        shared-secret           "1-2f00e84a355fdb14-1";
        after-sb-0pri           discard-younger-primary;
        after-sb-1pri           discard-secondary;
        after-sb-2pri           call-pri-lost-after-sb;
        rr-conflict             call-pri-lost;
        ping-timeout            10; # 1/10 seconds
}
syncer {
        rate                    30720k; # bytes/second
        after                   0;
        al-extents              1801;
        verify-alg              "sha13À";
}
protocol C;
_this_host {
        device                  "/dev/drbd1";
        disk                    "/dev/VolGroup00/udom";
        meta-disk               "/dev/VolGroup00/drbd_meta" [ 1 ];
        address                 ipv4 192.168.30.39:7790;
}
_remote_host {
        address                 ipv4 192.168.30.43:7790;
}
Feb  5 09:35:04 svdom0-0148 kernel: 0000:00:04.0: eth2: Link is Up 1000 Mbps
Full Duplex, Flow Control: RX/TX
Feb  5 09:35:05 svdom0-0148 kernel: drbd3: Handshake successful: Agreed
network protocol version 88
Feb  5 09:35:05 svdom0-0148 kernel: drbd3: Peer authenticated using 20 bytes
of 'sha1' HMAC
Feb  5 09:35:05 svdom0-0148 kernel: drbd3: conn( WFConnection ->
WFReportParams )
Feb  5 09:35:05 svdom0-0148 kernel: drbd3: Starting asender thread (from
drbd3_receiver [3093])
Feb  5 09:35:05 svdom0-0148 kernel: drbd3: data-integrity-alg: <not-used>
Feb  5 09:35:05 svdom0-0148 kernel: drbd3: drbd_sync_handshake:
Feb  5 09:35:05 svdom0-0148 kernel: drbd3: self
150D9DA7C5B29BA9:3A9E4435E86729C1:3FA48D41F246037E:7E2BC89046397529
Feb  5 09:35:05 svdom0-0148 kernel: drbd3: peer
3A9E4435E86729C0:0000000000000000:3FA48D41F246037E:7E2BC89046397529
Feb  5 09:35:05 svdom0-0148 kernel: drbd3: uuid_compare()=1 by rule 7
Feb  5 09:35:05 svdom0-0148 kernel: drbd3: peer( Unknown -> Secondary )
conn( WFReportParams -> WFBitMapS )
Feb  5 09:35:05 svdom0-0148 kernel: drbd3: conn( WFBitMapS -> SyncSource )
pdsk( Outdated -> Inconsistent )
Feb  5 09:35:05 svdom0-0148 kernel: drbd3: Began resync as SyncSource (will
sync 748 KB [187 bits set]).
Feb  5 09:35:05 svdom0-0148 kernel: drbd3: Resync done (total 1 sec; paused
0 sec; 748 K/sec)
Feb  5 09:35:05 svdom0-0148 kernel: drbd3: conn( SyncSource -> Connected )
pdsk( Inconsistent -> UpToDate )
Feb  5 09:35:06 svdom0-0148 kernel: drbd1: Handshake successful: Agreed
network protocol version 88
Feb  5 09:35:06 svdom0-0148 kernel: drbd0: Handshake successful: Agreed
network protocol version 88
Feb  5 09:35:06 svdom0-0148 kernel: drbd0: Peer authenticated using 20 bytes
of 'sha1' HMAC
Feb  5 09:35:06 svdom0-0148 kernel: drbd0: conn( WFConnection ->
WFReportParams )
Feb  5 09:35:06 svdom0-0148 kernel: drbd0: Starting asender thread (from
drbd0_receiver [3086])
Feb  5 09:35:06 svdom0-0148 kernel: drbd0: data-integrity-alg: <not-used>
Feb  5 09:35:06 svdom0-0148 kernel: drbd0: drbd_sync_handshake:
Feb  5 09:35:06 svdom0-0148 kernel: drbd0: self
80E839F9ED2989D1:C9C6F6B3B97A8D7B:4E2CE535E32C0ABF:0FA521B18D47D1B3
Feb  5 09:35:06 svdom0-0148 kernel: drbd0: peer
C9C6F6B3B97A8D7A:0000000000000000:4E2CE535E32C0ABE:0FA521B18D47D1B3
Feb  5 09:35:06 svdom0-0148 kernel: drbd0: uuid_compare()=1 by rule 7
Feb  5 09:35:06 svdom0-0148 kernel: drbd0: peer( Unknown -> Secondary )
conn( WFReportParams -> WFBitMapS )
Feb  5 09:35:06 svdom0-0148 kernel: drbd1: Peer authenticated using 20 bytes
of 'sha1' HMAC
Feb  5 09:35:06 svdom0-0148 kernel: drbd1: conn( WFConnection ->
WFReportParams )
Feb  5 09:35:06 svdom0-0148 kernel: drbd1: Starting asender thread (from
drbd1_receiver [10867])
Feb  5 09:35:06 svdom0-0148 kernel: drbd0: conn( WFBitMapS -> SyncSource )
pdsk( Outdated -> Inconsistent )
Feb  5 09:35:06 svdom0-0148 kernel: drbd1: aftr_isp( 0 -> 1 )
Feb  5 09:35:06 svdom0-0148 kernel: drbd2: aftr_isp( 0 -> 1 )
Feb  5 09:35:06 svdom0-0148 kernel: drbd1: data-integrity-alg: <not-used>
Feb  5 09:35:06 svdom0-0148 kernel: drbd3: aftr_isp( 0 -> 1 )
Feb  5 09:35:06 svdom0-0148 kernel: drbd0: Began resync as SyncSource (will
sync 0 KB [0 bits set]).
Feb  5 09:35:06 svdom0-0148 kernel: drbd0: Resync done (total 1 sec; paused
0 sec; 0 K/sec)
Feb  5 09:35:06 svdom0-0148 kernel: drbd0: conn( SyncSource -> Connected )
pdsk( Inconsistent -> UpToDate )
Feb  5 09:35:06 svdom0-0148 kernel: drbd1: aftr_isp( 1 -> 0 )
Feb  5 09:35:06 svdom0-0148 kernel: drbd2: aftr_isp( 1 -> 0 )
Feb  5 09:35:06 svdom0-0148 kernel: drbd3: aftr_isp( 1 -> 0 )
Feb  5 09:35:06 svdom0-0148 kernel: drbd1: drbd_sync_handshake:
Feb  5 09:35:06 svdom0-0148 kernel: drbd1: self
BC155EBFB3789E01:28B8724AE2280D0B:9CD4D02C2222C79E:A5C04939BEC1A435
Feb  5 09:35:06 svdom0-0148 kernel: drbd1: peer
28B8724AE2280D0A:0000000000000000:9CD4D02C2222C79E:A5C04939BEC1A435
Feb  5 09:35:06 svdom0-0148 kernel: drbd1: uuid_compare()=1 by rule 7
Feb  5 09:35:06 svdom0-0148 kernel: drbd1: peer( Unknown -> Secondary )
conn( WFReportParams -> WFBitMapS )
Feb  5 09:35:06 svdom0-0148 kernel: drbd1: conn( WFBitMapS -> SyncSource )
pdsk( Outdated -> Inconsistent )
Feb  5 09:35:06 svdom0-0148 kernel: drbd2: aftr_isp( 0 -> 1 )
Feb  5 09:35:06 svdom0-0148 kernel: drbd3: aftr_isp( 0 -> 1 )
Feb  5 09:35:06 svdom0-0148 kernel: drbd1: Began resync as SyncSource (will
sync 78348 KB [19587 bits set]).
Feb  5 09:35:06 svdom0-0148 kernel: drbd3: peer_isp( 0 -> 1 )
Feb  5 09:35:06 svdom0-0148 kernel: drbd1: Implicit set pdsk Inconsistent!
Feb  5 09:35:06 svdom0-0148 kernel: drbd1: conn( SyncSource -> PausedSyncS )
peer_isp( 0 -> 1 )
Feb  5 09:35:06 svdom0-0148 kernel: drbd1: Resync suspended
Feb  5 09:35:06 svdom0-0148 kernel: drbd1: conn( PausedSyncS -> SyncSource )
pdsk( Inconsistent -> Outdated ) peer_isp( 1 -> 0 )
Feb  5 09:35:06 svdom0-0148 kernel: drbd1: Syncer continues.
Feb  5 09:35:06 svdom0-0148 kernel: drbd3: peer_isp( 1 -> 0 )
Feb  5 09:35:06 svdom0-0148 kernel: drbd3: peer_isp( 0 -> 1 )
Feb  5 09:35:06 svdom0-0148 kernel: drbd1: cs:SyncSource rs_left=19637 >
rs_total=19587 (rs_failed 0)
Feb  5 09:35:07 svdom0-0148 heartbeat: [4284]: info: Link svdom0-0146:eth2
up.
Feb  5 09:35:07 svdom0-0148 ipfail: [4408]: info: Link Status update: Link
svdom0-0146/eth2 now has status up
Feb  5 09:35:07 svdom0-0148 kernel: drbd2: Handshake successful: Agreed
network protocol version 88
Feb  5 09:35:07 svdom0-0148 kernel: drbd2: Peer authenticated using 20 bytes
of 'sha1' HMAC
Feb  5 09:35:07 svdom0-0148 kernel: drbd2: conn( WFConnection ->
WFReportParams )
Feb  5 09:35:07 svdom0-0148 kernel: drbd2: Starting asender thread (from
drbd2_receiver [3091])
Feb  5 09:35:07 svdom0-0148 kernel: drbd2: data-integrity-alg: <not-used>
Feb  5 09:35:07 svdom0-0148 kernel: drbd2: drbd_sync_handshake:
Feb  5 09:35:07 svdom0-0148 kernel: drbd2: self
6F1EE2FDA1AC2477:BB754D8C3F96D9A5:5AAC0CAD16A6DA72:F3F1734E970763D9
Feb  5 09:35:07 svdom0-0148 kernel: drbd2: peer
BB754D8C3F96D9A4:0000000000000000:5AAC0CAD16A6DA73:F3F1734E970763D9
Feb  5 09:35:07 svdom0-0148 kernel: drbd2: uuid_compare()=1 by rule 7
Feb  5 09:35:07 svdom0-0148 kernel: drbd2: peer( Unknown -> Secondary )
conn( WFReportParams -> WFBitMapS ) peer_isp( 0 -> 1 )
Feb  5 09:35:07 svdom0-0148 kernel: drbd2: conn( WFBitMapS -> PausedSyncS )
pdsk( Outdated -> Inconsistent )
Feb  5 09:35:07 svdom0-0148 kernel: drbd2: Began resync as PausedSyncS (will
sync 0 KB [0 bits set]).
Feb  5 09:35:07 svdom0-0148 kernel: drbd2: Resync done (total 1 sec; paused
0 sec; 0 K/sec)
Feb  5 09:35:07 svdom0-0148 kernel: drbd2: conn( PausedSyncS -> Connected )
pdsk( Inconsistent -> UpToDate )
Feb  5 09:35:09 svdom0-0148 kernel: drbd2: peer_isp( 1 -> 0 )
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20090205/5204c2b4/attachment.htm>


More information about the drbd-user mailing list