Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello, We have been using DRBD for about 4 years now, so I have some experience with it. Today it was the first time that DRBD actually caused data loss. We mostly use DRBD in cases where older hardware is reused (upgraded with HDDs) and serves as a hot storage in case of primary node failure. This also means a lot of usage on md arrays, as usually only the primary will have a suitable RAID controller with BBU. I understand the shortcomings of such setups and have always been successful in achieving suitable performance. Nothing is perfect and you need to work with what you have. Most of the setups is still 8.3 (with CentOS 6) and I have been testing 8.4 on some nodes and here's where the problem lies. One of the setups actually replicates over WAN with proto A (no proxy.), again not perfect but you do what you have to do. The system was running stable since 2012. I have been using disk-timeout option (set at 900) and ko-count (5). Again, this is not perfect (I read the documentation) but we have been bitten a few times when a bad sector on a disk rendered the entire cluster unresponsive. This is unacceptable. Having some sort of mechanism to detach local storage in case of problems seems to be one reason for making a cluster in the first place. The problem starts a approx. 10:21: Jan 16 10:21:33 primary kernel: block drbd2: Local backing device failed to meet the disk-timeout Jan 16 10:21:33 primary kernel: block drbd2: disk( UpToDate -> Failed ) Jan 16 10:21:33 primary kernel: block drbd2: Local IO failed in request_timer_fn. Detaching... Jan 16 10:21:33 primary kernel: block drbd2: local WRITE IO error sector 111041344+8 on md22 Jan 16 10:21:33 primary kernel: block drbd2: helper command: /sbin/drbdadm local-io-error minor-2 Local device is timeouting after 90s (!), without any I/O related errors in kernel. The disks have recently been upgraded to SSDs, so badblocks not really possible! Jan 16 10:22:33 primary kernel: block drbd2: Remote failed to finish a request within ko-count * timeout Jan 16 10:22:33 primary kernel: block drbd2: peer( Secondary -> Unknown ) conn( Connected -> Timeout ) pdsk( UpToDate -> DUnknown ) Jan 16 10:22:33 primary kernel: drbd kvmimages: susp( 0 -> 1 ) Jan 16 10:22:33 primary kernel: drbd kvmimages: asender terminated Jan 16 10:22:33 primary kernel: drbd kvmimages: Terminating drbd_a_kvmimages Exactly one minute later (!), the remote stops responding (both fail at the same time?). As on-suspend is configured, all I/O is suspended. Ok, unusual, but still ok so far. Jan 16 10:23:34 primary kernel: block drbd2: helper command: /sbin/drbdadm local-io-error minor-2 exit code 20 (0x1400) Jan 16 10:23:34 primary kernel: block drbd2: bitmap WRITE of 0 pages took 0 jiffies Jan 16 10:23:34 primary kernel: block drbd2: 0 KB (0 bits) marked out-of-sync by on disk bit-map. Jan 16 10:23:34 primary kernel: block drbd2: disk( Failed -> Diskless ) Jan 16 10:23:34 primary kernel: block drbd2: helper command: /sbin/drbdadm pri-on-incon-degr minor-2 Jan 16 10:23:34 primary kernel: block drbd2: helper command: /sbin/drbdadm pri-on-incon-degr minor-2 exit code 0 (0x0) Jan 16 10:23:34 primary kernel: drbd kvmimages: Connection closed Jan 16 10:23:34 primary kernel: drbd kvmimages: out of mem, failed to invoke fence-peer helper Jan 16 10:23:34 primary kernel: drbd kvmimages: conn( Timeout -> Unconnected ) Jan 16 10:23:34 primary kernel: drbd kvmimages: receiver terminated Jan 16 10:23:34 primary kernel: drbd kvmimages: Restarting receiver thread Jan 16 10:23:34 primary kernel: drbd kvmimages: receiver (re)started Jan 16 10:23:34 primary kernel: drbd kvmimages: conn( Unconnected -> WFConnection ) Jan 16 10:23:35 primary kernel: drbd kvmimages: Handshake successful: Agreed network protocol version 101 Jan 16 10:23:35 primary kernel: drbd kvmimages: Agreed to support TRIM on protocol level Jan 16 10:23:35 primary kernel: drbd kvmimages: conn( WFConnection -> WFReportParams ) Jan 16 10:23:35 primary kernel: drbd kvmimages: Starting asender thread (from drbd_r_kvmimages [42146]) Jan 16 10:23:35 primary kernel: block drbd2: receiver updated UUIDs to effective data uuid: C7F11A2B6505F460 Jan 16 10:23:35 primary kernel: block drbd2: peer( Unknown -> Secondary ) conn( WFReportParams -> Connected ) pdsk( DUnknown -> UpToDate ) Jan 16 10:23:35 primary kernel: block drbd2: Should have called drbd_al_complete_io(, 111041344, 4096), but my Disk seems to have failed :( Jan 16 10:23:35 primary kernel: drbd kvmimages: susp( 1 -> 0 ) Helpers called, remote reconnected, suspend is lifted. At this point, Primary is running diskless. Now, here comes the fun part.the admin (me) notices this after a couple of hours (I was away) and re-attaches the disk on primary node (as per documentation: "If using internal meta data, it is sufficient to bind the DRBD device to the new hard disk."). Here's where everything goes south: Jan 16 12:22:03 primary kernel: block drbd2: disk( Diskless -> Attaching ) Jan 16 12:22:03 primary kernel: block drbd2: recounting of set bits took additional 3 jiffies Jan 16 12:22:03 primary kernel: block drbd2: 4948 MB (1266688 bits) marked out-of-sync by on disk bit-map. Jan 16 12:22:03 primary kernel: block drbd2: disk( Attaching -> Negotiating ) Jan 16 12:22:03 primary kernel: block drbd2: attached to UUIDs C7F11A2B6505F461:0000000000000000:9C59495AB871DD6E:2DCB0BB166FC17AB Jan 16 12:22:03 primary kernel: block drbd2: drbd_sync_handshake: Jan 16 12:22:03 primary kernel: block drbd2: self C7F11A2B6505F461:0000000000000000:9C59495AB871DD6E:2DCB0BB166FC17AB bits:1266688 flags:0 Jan 16 12:22:03 primary kernel: block drbd2: peer C7F11A2B6505F460:0000000000000000:9C59495AB871DD6E:2DCB0BB166FC17AA bits:53805 flags:0 Jan 16 12:22:03 primary kernel: block drbd2: uuid_compare()=1 by rule 40 Jan 16 12:22:03 primary kernel: block drbd2: conn( Connected -> WFBitMapS ) disk( Negotiating -> UpToDate ) pdsk( UpToDate -> Consistent ) Why is primary disk now UpToDate, when it was offline for 2 hours? It should be inconsistent and resynced from the Secondary. To make matters worse, pull-ahead is triggered at the same time (remember, replicating over WAN), since 5GB was marked out-of-sync by making the disk on primary up to date: Jan 16 12:22:03 primary kernel: block drbd2: conn( Connected -> WFBitMapS ) disk( Negotiating -> UpToDate ) pdsk( UpToDate -> Consistent ) Jan 16 12:22:03 primary kernel: block drbd2: Congestion-fill threshold reached Jan 16 12:22:03 primary kernel: block drbd2: conn( WFBitMapS -> Ahead ) Jan 16 12:22:03 primary kernel: block drbd2: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 4846(2), total 4846; compression: 99.9% Now the disks stay in this state, the sync is actually not starting: [root at primary] # cat /proc/drbd version: 8.4.5 (api:1/proto:86-101) GIT-hash: 1d360bde0e095d495786eaeb2a1ac76888e4db96 build by phil at Build64R6, 2014-10-28 10:32:53 1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate A r----- ns:48488816 nr:0 dw:106632540 dr:213443889 al:24231 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0 2: cs:Ahead ro:Primary/Secondary ds:UpToDate/Consistent A r----- ns:0 nr:0 dw:564028 dr:163202 al:52275 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:5190228 3: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate A r----- ns:13699098 nr:0 dw:34361764 dr:248373080 al:3374 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0 [root at secondary] # cat /proc/drbd version: 8.4.5 (api:1/proto:86-101) GIT-hash: 1d360bde0e095d495786eaeb2a1ac76888e4db96 build by phil at Build64R6, 2014-10-28 10:32:53 1: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate A r----- ns:0 nr:49240716 dw:61104624 dr:176250476 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0 2: cs:Behind ro:Secondary/Primary ds:Outdated/UpToDate A r----- ns:53230 nr:46356310 dw:48639878 dr:55173318 al:886 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:5241020 3: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate A r----- ns:0 nr:21494090 dw:21494090 dr:195651772 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0 With both nodes sure, that primary is UpToDate! Recognizing the problem, I started shutting down services and copying all the data (whatever it was) off the drbd partitions.somewhere along the copying, maybe the WAN link failed or something else happened, drbd started resyncing: Jan 16 16:36:16 primary kernel: block drbd2: new current UUID A30F012BA25C9803:C7F11A2B6505F461:9C59495AB871DD6E:2DCB0BB166FC17AB Jan 16 16:36:38 primary kernel: block drbd2: drbd_sync_handshake: Jan 16 16:36:38 primary kernel: block drbd2: self A30F012BA25C9803:C7F11A2B6505F461:9C59495AB871DD6E:2DCB0BB166FC17AB bits:1299480 flags:0 Jan 16 16:36:38 primary kernel: block drbd2: peer C7F11A2B6505F460:0000000000000000:9C59495AB871DD6E:2DCB0BB166FC17AA bits:1312123 flags:0 Jan 16 16:36:38 primary kernel: block drbd2: uuid_compare()=1 by rule 70 Jan 16 16:36:38 primary kernel: block drbd2: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> Consistent ) Jan 16 16:36:38 primary kernel: block drbd2: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 6711(2), total 6711; compression: 99.8% Jan 16 16:36:39 primary kernel: block drbd2: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 7015(2), total 7015; compression: 99.8% Jan 16 16:36:39 primary kernel: block drbd2: helper command: /sbin/drbdadm before-resync-source minor-2 Jan 16 16:36:39 primary kernel: block drbd2: helper command: /sbin/drbdadm before-resync-source minor-2 exit code 0 (0x0) Jan 16 16:36:39 primary kernel: block drbd2: conn( WFBitMapS -> SyncSource ) pdsk( Consistent -> Inconsistent ) Jan 16 16:36:39 primary kernel: block drbd2: Began resync as SyncSource (will sync 5248492 KB [1312123 bits set]). Jan 16 16:36:39 primary kernel: block drbd2: updated sync UUID A30F012BA25C9803:C7F21A2B6505F461:C7F11A2B6505F461:9C59495AB871DD6E Jan 16 16:41:29 primary kernel: block drbd2: Resync done (total 289 sec; paused 0 sec; 18160 K/sec) Jan 16 16:41:29 primary kernel: block drbd2: 69 % had equal checksums, eliminated: 3672368K; transferred 1576124K total 5248492K Jan 16 16:41:29 primary kernel: block drbd2: updated UUIDs A30F012BA25C9803:0000000000000000:C7F21A2B6505F461:C7F11A2B6505F461 Jan 16 16:41:29 primary kernel: block drbd2: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate ) .and overwritten approx. 1.5GB data on the secondary from probably stale data (and new data written to the primary in between) from the primary. There's nothing in the logs of the Secondary to indicate any disk problems around 10:22: Jan 16 10:22:33 secondary kernel: drbd kvmimages: sock was shut down by peer Jan 16 10:22:33 secondary kernel: drbd kvmimages: peer( Primary -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown ) Jan 16 10:22:33 secondary kernel: drbd kvmimages: short read (expected size 16) Jan 16 10:22:33 secondary kernel: drbd kvmimages: meta connection shut down by peer. Jan 16 10:22:33 secondary kernel: drbd kvmimages: asender terminated Jan 16 10:22:33 secondary kernel: drbd kvmimages: Terminating drbd_a_kvmimages Jan 16 10:22:33 secondary kernel: drbd kvmimages: Connection closed Jan 16 10:22:33 secondary kernel: drbd kvmimages: conn( BrokenPipe -> Unconnected ) Jan 16 10:22:33 secondary kernel: drbd kvmimages: receiver terminated Jan 16 10:22:34 secondary kernel: drbd kvmimages: Restarting receiver thread Jan 16 10:22:34 secondary kernel: drbd kvmimages: receiver (re)started Jan 16 10:22:34 secondary kernel: drbd kvmimages: conn( Unconnected -> WFConnection ) Jan 16 10:23:35 secondary kernel: drbd kvmimages: Handshake successful: Agreed network protocol version 101 Jan 16 10:23:35 secondary kernel: drbd kvmimages: Agreed to support TRIM on protocol level Jan 16 10:23:35 secondary kernel: drbd kvmimages: conn( WFConnection -> WFReportParams ) Jan 16 10:23:35 secondary kernel: drbd kvmimages: Starting asender thread (from drbd_r_kvmimages [11935]) Jan 16 10:23:35 secondary kernel: block drbd2: peer( Unknown -> Primary ) conn( WFReportParams -> Connected ) pdsk( DUnknown -> Diskless ) As far as it is concerned, the primary terminated connection !? After primary was reattached: Jan 16 12:22:03 secondary kernel: block drbd2: real peer disk state = Consistent Jan 16 12:22:03 secondary kernel: block drbd2: drbd_sync_handshake: Jan 16 12:22:03 secondary kernel: block drbd2: self C7F11A2B6505F460:0000000000000000:9C59495AB871DD6E:2DCB0BB166FC17AA bits:0 flags:0 Jan 16 12:22:03 secondary kernel: block drbd2: peer C7F11A2B6505F461:0000000000000000:9C59495AB871DD6E:2DCB0BB166FC17AB bits:1266688 flags:2 Jan 16 12:22:03 secondary kernel: block drbd2: uuid_compare()=-1 by rule 40 Jan 16 12:22:03 secondary kernel: block drbd2: conn( Connected -> WFBitMapT ) disk( UpToDate -> Outdated ) pdsk( Diskless -> UpToDate ) Jan 16 12:22:04 secondary kernel: block drbd2: conn( WFBitMapT -> Behind ) Jan 16 12:22:04 secondary kernel: block drbd2: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 4846(2), total 4846; compression: 99.9% Jan 16 12:22:04 secondary kernel: block drbd2: unexpected cstate (Behind) in receive_bitmap And then happily overwrote it's data: Jan 16 16:36:38 secondary kernel: block drbd2: drbd_sync_handshake: Jan 16 16:36:38 secondary kernel: block drbd2: self C7F11A2B6505F460:0000000000000000:9C59495AB871DD6E:2DCB0BB166FC17AA bits:1312123 flags:0 Jan 16 16:36:38 secondary kernel: block drbd2: peer A30F012BA25C9803:C7F11A2B6505F461:9C59495AB871DD6E:2DCB0BB166FC17AB bits:1299480 flags:2 Jan 16 16:36:38 secondary kernel: block drbd2: uuid_compare()=-1 by rule 50 Jan 16 16:36:38 secondary kernel: block drbd2: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) Jan 16 16:36:39 secondary kernel: block drbd2: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 6711(2), total 6711; compression: 99.8% Jan 16 16:36:39 secondary kernel: block drbd2: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 7015(2), total 7015; compression: 99.8% Jan 16 16:36:39 secondary kernel: block drbd2: conn( WFBitMapT -> WFSyncUUID ) Jan 16 16:36:39 secondary kernel: block drbd2: updated sync uuid C7F21A2B6505F460:0000000000000000:9C59495AB871DD6E:2DCB0BB166FC17AA Jan 16 16:36:39 secondary kernel: block drbd2: helper command: /sbin/drbdadm before-resync-target minor-2 Jan 16 16:36:39 secondary kernel: block drbd2: helper command: /sbin/drbdadm before-resync-target minor-2 exit code 0 (0x0) Jan 16 16:36:39 secondary kernel: block drbd2: conn( WFSyncUUID -> SyncTarget ) disk( Outdated -> Inconsistent ) Jan 16 16:36:39 secondary kernel: block drbd2: Began resync as SyncTarget (will sync 5248492 KB [1312123 bits set]). Jan 16 16:41:29 secondary kernel: block drbd2: Resync done (total 289 sec; paused 0 sec; 18160 K/sec) Jan 16 16:41:29 secondary kernel: block drbd2: 69 % had equal checksums, eliminated: 3672368K; transferred 1576124K total 5248492K Jan 16 16:41:29 secondary kernel: block drbd2: updated UUIDs A30F012BA25C9802:0000000000000000:C7F21A2B6505F460:C7F11A2B6505F461 Jan 16 16:41:29 secondary kernel: block drbd2: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate ) Jan 16 16:41:29 secondary kernel: block drbd2: helper command: /sbin/drbdadm after-resync-target minor-2 Jan 16 16:41:29 secondary kernel: block drbd2: helper command: /sbin/drbdadm after-resync-target minor-2 exit code 0 (0x0) Now, sure, the documentation states that disk-timeout is dangerous. I accept that, 'buy better hardware'.but the point here is, that there doesn't seem anything wrong with the hardware. And if there was, drbd should be able to recognize where the UpToDate data is - that's the whole point. The most interesting part is that this was actually the second occurrence of a very similar problem. The first was on vastly different hardware (PERC H710 with BBU.), there's absolutely no way it would timeout: Jan 12 19:36:30 glinx kernel: block drbd1: Local backing device failed to meet the disk-timeout Jan 12 19:36:30 glinx kernel: block drbd1: disk( UpToDate -> Failed ) Jan 12 19:36:30 glinx kernel: block drbd1: Local IO failed in request_timer_fn. Detaching... Jan 12 19:36:30 glinx kernel: block drbd1: helper command: /sbin/drbdadm local-io-error minor-1 Jan 12 19:37:30 glinx kernel: block drbd1: Remote failed to finish a request within ko-count * timeout Jan 12 19:37:30 glinx kernel: block drbd1: peer( Secondary -> Unknown ) conn( Connected -> Timeout ) pdsk( UpToDate -> DUnknown ) Jan 12 19:37:30 glinx kernel: drbd dbserver: susp( 0 -> 1 ) Jan 12 19:37:30 glinx kernel: drbd dbserver: asender terminated Jan 12 19:37:30 glinx kernel: drbd dbserver: Terminating drbd_a_dbserver Jan 12 19:38:31 glinx kernel: block drbd1: helper command: /sbin/drbdadm local-io-error minor-1 exit code 20 (0x1400) Jan 12 19:38:31 glinx kernel: block drbd1: bitmap WRITE of 0 pages took 0 jiffies Jan 12 19:38:31 glinx kernel: block drbd1: 0 KB (0 bits) marked out-of-sync by on disk bit-map. Jan 12 19:38:31 glinx kernel: block drbd1: disk( Failed -> Diskless ) Jan 12 19:38:31 glinx kernel: block drbd1: helper command: /sbin/drbdadm pri-on-incon-degr minor-1 Jan 12 19:38:31 glinx kernel: block drbd1: helper command: /sbin/drbdadm pri-on-incon-degr minor-1 exit code 0 (0x0) Jan 12 19:38:31 glinx kernel: drbd dbserver: Connection closed Jan 12 19:38:31 glinx kernel: drbd dbserver: out of mem, failed to invoke fence-peer helper Jan 12 19:38:31 glinx kernel: drbd dbserver: conn( Timeout -> Unconnected ) Jan 12 19:38:31 glinx kernel: drbd dbserver: receiver terminated Jan 12 19:38:31 glinx kernel: drbd dbserver: Restarting receiver thread Jan 12 19:38:31 glinx kernel: drbd dbserver: receiver (re)started Jan 12 19:38:31 glinx kernel: drbd dbserver: conn( Unconnected -> WFConnection ) Jan 12 19:38:32 glinx kernel: drbd dbserver: Handshake successful: Agreed network protocol version 101 Jan 12 19:38:32 glinx kernel: drbd dbserver: Agreed to support TRIM on protocol level Jan 12 19:38:32 glinx kernel: drbd dbserver: conn( WFConnection -> WFReportParams ) Jan 12 19:38:32 glinx kernel: drbd dbserver: Starting asender thread (from drbd_r_dbserver [32682]) Jan 12 19:38:32 glinx kernel: block drbd1: receiver updated UUIDs to effective data uuid: 43734559ED5920C0 Jan 12 19:38:32 glinx kernel: block drbd1: peer( Unknown -> Secondary ) conn( WFReportParams -> Connected ) pdsk( DUnknown -> UpToDate ) Jan 12 19:38:32 glinx kernel: block drbd1: Should have called drbd_al_complete_io(, 29734872, 4096), but my Disk seems to have failed :( Jan 12 19:38:32 glinx kernel: drbd dbserver: susp( 1 -> 0 ) Again the same set of events.local disk timeouts, 1 min later remote timeouts. After reattaching, local device thinks it's UpToDate: Jan 12 19:53:17 glinx kernel: block drbd1: disk( Diskless -> Attaching ) Jan 12 19:53:17 glinx kernel: block drbd1: recounting of set bits took additional 4 jiffies Jan 12 19:53:17 glinx kernel: block drbd1: 4948 MB (1266688 bits) marked out-of-sync by on disk bit-map. Jan 12 19:53:17 glinx kernel: block drbd1: disk( Attaching -> Negotiating ) Jan 12 19:53:17 glinx kernel: block drbd1: attached to UUIDs 43734559ED5920C1:0000000000000000:47C0D52F0FB8F92A:47BFD52F0FB8F92A Jan 12 19:53:18 glinx kernel: block drbd1: drbd_sync_handshake: Jan 12 19:53:18 glinx kernel: block drbd1: self 43734559ED5920C1:0000000000000000:47C0D52F0FB8F92A:47BFD52F0FB8F92A bits:1266688 flags:0 Jan 12 19:53:18 glinx kernel: block drbd1: peer 43734559ED5920C0:0000000000000000:47C0D52F0FB8F92A:47BFD52F0FB8F92A bits:235 flags:0 Jan 12 19:53:18 glinx kernel: block drbd1: uuid_compare()=1 by rule 40 Jan 12 19:53:18 glinx kernel: block drbd1: conn( Connected -> WFBitMapS ) disk( Negotiating -> UpToDate ) pdsk( UpToDate -> Consistent ) Jan 12 19:53:18 glinx kernel: block drbd1: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 3752(1), total 3752; compression: 100.0% Jan 12 19:53:18 glinx kernel: block drbd1: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 3752(1), total 3752; compression: 100.0% Jan 12 19:53:18 glinx kernel: block drbd1: helper command: /sbin/drbdadm before-resync-source minor-1 Jan 12 19:53:18 glinx kernel: block drbd1: helper command: /sbin/drbdadm before-resync-source minor-1 exit code 0 (0x0) Jan 12 19:53:18 glinx kernel: block drbd1: conn( WFBitMapS -> SyncSource ) pdsk( Consistent -> Inconsistent ) Jan 12 19:53:18 glinx kernel: block drbd1: Began resync as SyncSource (will sync 5066752 KB [1266688 bits set]). SynsSource, how can you be SyncSource if the disk was not attached for 15 mins? Not that this setup is running proto B on local gigabit and no pull-ahead. This is far too common to be a coincidence and unless I'm making a big error somewhere, shouldn't be happening. Also note, that all the helpers are left in some weird state (this was captured at 16:57): root 103848 0.0 0.0 11336 612 ? S 10:21 0:00 /bin/bash /usr/lib/drbd/notify-io-error.sh root 103849 0.0 0.0 4060 616 ? S 10:21 0:00 logger -t notify-io-error.sh[103847] -p local5.info root 103931 0.0 0.0 11336 612 ? S 10:23 0:00 /bin/bash /usr/lib/drbd/notify-pri-on-incon-degr.sh root 103932 0.0 0.0 4060 616 ? S 10:23 0:00 logger -t notify-pri-on-incon-degr.sh[103930] -p local5.info Configuration from the first (WAN) cluster setup: common { handlers { pri-on-incon-degr /usr/lib/drbd/notify-pri-on-incon-degr.sh; pri-lost-after-sb /usr/lib/drbd/notify-pri-lost-after-sb.sh; local-io-error "/usr/lib/drbd/notify-io-error.sh; drbdadm detach $DRBD_RESOURCE"; out-of-sync "/etc/scripts/drbd-verify.sh out-of-sync"; split-brain "/usr/lib/drbd/notify-split-brain.sh root"; } startup { degr-wfc-timeout 30; outdated-wfc-timeout 30; wfc-timeout 30; } options { on-no-data-accessible suspend-io; } disk { on-io-error call-local-io-error; disk-timeout 900; c-plan-ahead 20; c-fill-target 3M; c-min-rate 1M; c-max-rate 25M; } net { protocol A; ko-count 5; on-congestion pull-ahead; congestion-fill 7000K; congestion-extents 127; use-rle yes; verify-alg sha1; csums-alg sha1; } } resource kvmimages { on primary { device /dev/drbd2; disk /dev/md22; address 172.16.10.1:7790; meta-disk internal; } on secondary { device /dev/drbd2; disk /dev/md6; address 172.16.10.6:7790; meta-disk internal; } net { sndbuf-size 10M; rcvbuf-size 512k; } } As problems like this never happened on 8.3 (local disk failing without any obvious reason), I can't say how 8.3 would react in the similar situation (diskless on primary). But still if drbd is running diskless on primary, it shouldn't overwrite the secondary data after it is back online. Surely the reattached disk is not supposed to be invalidated manually before reattaching? Also I haven't really checked why handlers are left running in the processes as I have just reverted back to 8.3.16 for now. Regards, Saso Slavicic -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20150116/f5ecdc3f/attachment.htm>