<div>Hi all,</div>
<div> </div>
<div>we are running Xen VMs on top of DRBD, DRBD resources are defined on top of LVMs. We use 64-bit CentOS 5.2 (2.6.18-92.1.22.el5xen). Previously we were testing the setup with DRBD RPMs from CentOS distribution (8.2.6-3), but we met an issue: device on top of which still runs Xen VM at the time of DRBD communication path is broken (we just removed dedicated crossover cable for simple tests) for some time, stalled at the sync progress at 100% after reconnection. This was easily reproducible and the more changes occured on the device when disconnected the higher probability of the stalling. We use synchronuous resync definition (using "after" config) so it means for us that all the followers are stuck in PausedSync states with inconsistent data state. Reconnection of this device solves the issue, however, there is no handler for such situations and devices itself looks happy (syncing although at 100%).</div>
<div>So we tried to upgrade to DRBD 8.2.7 (GIT-hash: 61b7f4c2fc34fe3d2acf7be6bcc1fc2684708a7d) - it seemed like this release solved such issue. However, we still experience this, although not so often and the behaviour is different - device get stalled at e.g. 25% and then the number decreases. This is I think because still new changes are coming so the update of statistics gives such results.</div>
<div> </div>
<div>I tried to look for stalling issues on the list but seems like there is no definite answer. If anyone has an experience with some kind of information on how to prevent such issues, it would be great. Most of the issues what I saw were related to network quality or huge amount of data that needs to be resynced. But we are trying simply plug out the cable.</div>
<div> </div>
<div>I am enclosing dump of related device only, all others are exactly the same excepting LVMs ... and corresponding /var/log/messages section.</div>
<div> </div>
<div># drbdsetup /dev/drbd1 show<br>disk {<br> size 0s _is_default; # bytes<br> on-io-error detach;<br> fencing resource-only;<br> max-bio-bvecs 0 _is_default;<br>
}<br>net {<br> timeout 60 _is_default; # 1/10 seconds<br> max-epoch-size 512;<br> max-buffers 512;<br> unplug-watermark 128 _is_default;<br> connect-int 2; # seconds<br>
ping-int 2; # seconds<br> sndbuf-size 0; # bytes<br> ko-count 0 _is_default;<br> cram-hmac-alg "sha1";<br> shared-secret "1-2f00e84a355fdb14-1";<br>
after-sb-0pri discard-younger-primary;<br> after-sb-1pri discard-secondary;<br> after-sb-2pri call-pri-lost-after-sb;<br> rr-conflict call-pri-lost;<br>
ping-timeout 10; # 1/10 seconds<br>}<br>syncer {<br> rate 30720k; # bytes/second<br> after 0;<br> al-extents 1801;<br> verify-alg "sha13À";<br>
}<br>protocol C;<br>_this_host {<br> device "/dev/drbd1";<br> disk "/dev/VolGroup00/udom";<br> meta-disk "/dev/VolGroup00/drbd_meta" [ 1 ];<br>
address ipv4 <a href="http://192.168.30.39:7790">192.168.30.39:7790</a>;<br>}<br>_remote_host {<br> address ipv4 <a href="http://192.168.30.43:7790">192.168.30.43:7790</a>;<br>
}<br></div>
<div>Feb 5 09:35:04 svdom0-0148 kernel: 0000:00:04.0: eth2: Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX<br>Feb 5 09:35:05 svdom0-0148 kernel: drbd3: Handshake successful: Agreed network protocol version 88<br>
Feb 5 09:35:05 svdom0-0148 kernel: drbd3: Peer authenticated using 20 bytes of 'sha1' HMAC<br>Feb 5 09:35:05 svdom0-0148 kernel: drbd3: conn( WFConnection -> WFReportParams ) <br>Feb 5 09:35:05 svdom0-0148 kernel: drbd3: Starting asender thread (from drbd3_receiver [3093])<br>
Feb 5 09:35:05 svdom0-0148 kernel: drbd3: data-integrity-alg: <not-used><br>Feb 5 09:35:05 svdom0-0148 kernel: drbd3: drbd_sync_handshake:<br>Feb 5 09:35:05 svdom0-0148 kernel: drbd3: self 150D9DA7C5B29BA9:3A9E4435E86729C1:3FA48D41F246037E:7E2BC89046397529<br>
Feb 5 09:35:05 svdom0-0148 kernel: drbd3: peer 3A9E4435E86729C0:0000000000000000:3FA48D41F246037E:7E2BC89046397529<br>Feb 5 09:35:05 svdom0-0148 kernel: drbd3: uuid_compare()=1 by rule 7<br>Feb 5 09:35:05 svdom0-0148 kernel: drbd3: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) <br>
Feb 5 09:35:05 svdom0-0148 kernel: drbd3: conn( WFBitMapS -> SyncSource ) pdsk( Outdated -> Inconsistent ) <br>Feb 5 09:35:05 svdom0-0148 kernel: drbd3: Began resync as SyncSource (will sync 748 KB [187 bits set]).<br>
Feb 5 09:35:05 svdom0-0148 kernel: drbd3: Resync done (total 1 sec; paused 0 sec; 748 K/sec)<br>Feb 5 09:35:05 svdom0-0148 kernel: drbd3: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate ) <br>Feb 5 09:35:06 svdom0-0148 kernel: drbd1: Handshake successful: Agreed network protocol version 88<br>
Feb 5 09:35:06 svdom0-0148 kernel: drbd0: Handshake successful: Agreed network protocol version 88<br>Feb 5 09:35:06 svdom0-0148 kernel: drbd0: Peer authenticated using 20 bytes of 'sha1' HMAC<br>Feb 5 09:35:06 svdom0-0148 kernel: drbd0: conn( WFConnection -> WFReportParams ) <br>
Feb 5 09:35:06 svdom0-0148 kernel: drbd0: Starting asender thread (from drbd0_receiver [3086])<br>Feb 5 09:35:06 svdom0-0148 kernel: drbd0: data-integrity-alg: <not-used><br>Feb 5 09:35:06 svdom0-0148 kernel: drbd0: drbd_sync_handshake:<br>
Feb 5 09:35:06 svdom0-0148 kernel: drbd0: self 80E839F9ED2989D1:C9C6F6B3B97A8D7B:4E2CE535E32C0ABF:0FA521B18D47D1B3<br>Feb 5 09:35:06 svdom0-0148 kernel: drbd0: peer C9C6F6B3B97A8D7A:0000000000000000:4E2CE535E32C0ABE:0FA521B18D47D1B3<br>
Feb 5 09:35:06 svdom0-0148 kernel: drbd0: uuid_compare()=1 by rule 7<br>Feb 5 09:35:06 svdom0-0148 kernel: drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) <br>Feb 5 09:35:06 svdom0-0148 kernel: drbd1: Peer authenticated using 20 bytes of 'sha1' HMAC<br>
Feb 5 09:35:06 svdom0-0148 kernel: drbd1: conn( WFConnection -> WFReportParams ) <br>Feb 5 09:35:06 svdom0-0148 kernel: drbd1: Starting asender thread (from drbd1_receiver [10867])<br>Feb 5 09:35:06 svdom0-0148 kernel: drbd0: conn( WFBitMapS -> SyncSource ) pdsk( Outdated -> Inconsistent ) <br>
Feb 5 09:35:06 svdom0-0148 kernel: drbd1: aftr_isp( 0 -> 1 ) <br>Feb 5 09:35:06 svdom0-0148 kernel: drbd2: aftr_isp( 0 -> 1 ) <br>Feb 5 09:35:06 svdom0-0148 kernel: drbd1: data-integrity-alg: <not-used><br>
Feb 5 09:35:06 svdom0-0148 kernel: drbd3: aftr_isp( 0 -> 1 ) <br>Feb 5 09:35:06 svdom0-0148 kernel: drbd0: Began resync as SyncSource (will sync 0 KB [0 bits set]).<br>Feb 5 09:35:06 svdom0-0148 kernel: drbd0: Resync done (total 1 sec; paused 0 sec; 0 K/sec)<br>
Feb 5 09:35:06 svdom0-0148 kernel: drbd0: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate ) <br>Feb 5 09:35:06 svdom0-0148 kernel: drbd1: aftr_isp( 1 -> 0 ) <br>Feb 5 09:35:06 svdom0-0148 kernel: drbd2: aftr_isp( 1 -> 0 ) <br>
Feb 5 09:35:06 svdom0-0148 kernel: drbd3: aftr_isp( 1 -> 0 ) <br>Feb 5 09:35:06 svdom0-0148 kernel: drbd1: drbd_sync_handshake:<br>Feb 5 09:35:06 svdom0-0148 kernel: drbd1: self BC155EBFB3789E01:28B8724AE2280D0B:9CD4D02C2222C79E:A5C04939BEC1A435<br>
Feb 5 09:35:06 svdom0-0148 kernel: drbd1: peer 28B8724AE2280D0A:0000000000000000:9CD4D02C2222C79E:A5C04939BEC1A435<br>Feb 5 09:35:06 svdom0-0148 kernel: drbd1: uuid_compare()=1 by rule 7<br>Feb 5 09:35:06 svdom0-0148 kernel: drbd1: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) <br>
Feb 5 09:35:06 svdom0-0148 kernel: drbd1: conn( WFBitMapS -> SyncSource ) pdsk( Outdated -> Inconsistent ) <br>Feb 5 09:35:06 svdom0-0148 kernel: drbd2: aftr_isp( 0 -> 1 ) <br>Feb 5 09:35:06 svdom0-0148 kernel: drbd3: aftr_isp( 0 -> 1 ) <br>
Feb 5 09:35:06 svdom0-0148 kernel: drbd1: Began resync as SyncSource (will sync 78348 KB [19587 bits set]).<br>Feb 5 09:35:06 svdom0-0148 kernel: drbd3: peer_isp( 0 -> 1 ) <br>Feb 5 09:35:06 svdom0-0148 kernel: drbd1: Implicit set pdsk Inconsistent!<br>
Feb 5 09:35:06 svdom0-0148 kernel: drbd1: conn( SyncSource -> PausedSyncS ) peer_isp( 0 -> 1 ) <br>Feb 5 09:35:06 svdom0-0148 kernel: drbd1: Resync suspended<br>Feb 5 09:35:06 svdom0-0148 kernel: drbd1: conn( PausedSyncS -> SyncSource ) pdsk( Inconsistent -> Outdated ) peer_isp( 1 -> 0 ) <br>
Feb 5 09:35:06 svdom0-0148 kernel: drbd1: Syncer continues.<br>Feb 5 09:35:06 svdom0-0148 kernel: drbd3: peer_isp( 1 -> 0 ) <br>Feb 5 09:35:06 svdom0-0148 kernel: drbd3: peer_isp( 0 -> 1 ) <br>Feb 5 09:35:06 svdom0-0148 kernel: drbd1: cs:SyncSource rs_left=19637 > rs_total=19587 (rs_failed 0)<br>
Feb 5 09:35:07 svdom0-0148 heartbeat: [4284]: info: Link svdom0-0146:eth2 up.<br>Feb 5 09:35:07 svdom0-0148 ipfail: [4408]: info: Link Status update: Link svdom0-0146/eth2 now has status up<br>Feb 5 09:35:07 svdom0-0148 kernel: drbd2: Handshake successful: Agreed network protocol version 88<br>
Feb 5 09:35:07 svdom0-0148 kernel: drbd2: Peer authenticated using 20 bytes of 'sha1' HMAC<br>Feb 5 09:35:07 svdom0-0148 kernel: drbd2: conn( WFConnection -> WFReportParams ) <br>Feb 5 09:35:07 svdom0-0148 kernel: drbd2: Starting asender thread (from drbd2_receiver [3091])<br>
Feb 5 09:35:07 svdom0-0148 kernel: drbd2: data-integrity-alg: <not-used><br>Feb 5 09:35:07 svdom0-0148 kernel: drbd2: drbd_sync_handshake:<br>Feb 5 09:35:07 svdom0-0148 kernel: drbd2: self 6F1EE2FDA1AC2477:BB754D8C3F96D9A5:5AAC0CAD16A6DA72:F3F1734E970763D9<br>
Feb 5 09:35:07 svdom0-0148 kernel: drbd2: peer BB754D8C3F96D9A4:0000000000000000:5AAC0CAD16A6DA73:F3F1734E970763D9<br>Feb 5 09:35:07 svdom0-0148 kernel: drbd2: uuid_compare()=1 by rule 7<br>Feb 5 09:35:07 svdom0-0148 kernel: drbd2: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) peer_isp( 0 -> 1 ) <br>
Feb 5 09:35:07 svdom0-0148 kernel: drbd2: conn( WFBitMapS -> PausedSyncS ) pdsk( Outdated -> Inconsistent ) <br>Feb 5 09:35:07 svdom0-0148 kernel: drbd2: Began resync as PausedSyncS (will sync 0 KB [0 bits set]).<br>
Feb 5 09:35:07 svdom0-0148 kernel: drbd2: Resync done (total 1 sec; paused 0 sec; 0 K/sec)<br>Feb 5 09:35:07 svdom0-0148 kernel: drbd2: conn( PausedSyncS -> Connected ) pdsk( Inconsistent -> UpToDate ) <br>Feb 5 09:35:09 svdom0-0148 kernel: drbd2: peer_isp( 1 -> 0 ) <br>
</div>