[DRBD-user] DRBD I/O problems & corrupted data

Fri Jan 16 20:53:16 CET 2015

Hello,

We have been using DRBD for about 4 years now, so I have some experience
with it. Today it was the first time that DRBD actually caused data loss.

We mostly use DRBD in cases where older hardware is reused (upgraded with
HDDs) and serves as a hot storage in case of primary node failure. This also
means a lot of usage on md arrays, as usually only the primary will have a
suitable RAID controller with BBU. I understand the shortcomings of such
setups and have always been successful in achieving suitable performance.
Nothing is perfect and you need to work with what you have.

Most of the setups is still 8.3 (with CentOS 6) and I have been testing 8.4
on some nodes and here's where the problem lies. One of the setups actually
replicates over WAN with proto A (no proxy.), again not perfect but you do
what you have to do. The system was running stable since 2012.

I have been using disk-timeout option (set at 900) and ko-count (5). Again,
this is not perfect (I read the documentation) but we have been bitten a few
times when a bad sector on a disk rendered the entire cluster unresponsive.
This is unacceptable. Having some sort of mechanism to detach local storage
in case of problems seems to be one reason for making a cluster in the first
place.

The problem starts a approx. 10:21:

Jan 16 10:21:33 primary kernel: block drbd2: Local backing device failed to
meet the disk-timeout

Jan 16 10:21:33 primary kernel: block drbd2: disk( UpToDate -&gt; Failed )

Jan 16 10:21:33 primary kernel: block drbd2: Local IO failed in
request_timer_fn. Detaching...

Jan 16 10:21:33 primary kernel: block drbd2: local WRITE IO error sector
111041344+8 on md22

Jan 16 10:21:33 primary kernel: block drbd2: helper command: /sbin/drbdadm
local-io-error minor-2

Local device is timeouting after 90s (!), without any I/O related errors in
kernel. The disks have recently been upgraded to SSDs, so badblocks not
really possible!

Jan 16 10:22:33 primary kernel: block drbd2: Remote failed to finish a
request within ko-count * timeout

Jan 16 10:22:33 primary kernel: block drbd2: peer( Secondary -&gt; Unknown )
conn( Connected -&gt; Timeout ) pdsk( UpToDate -&gt; DUnknown )

Jan 16 10:22:33 primary kernel: drbd kvmimages: susp( 0 -&gt; 1 )

Jan 16 10:22:33 primary kernel: drbd kvmimages: asender terminated

Jan 16 10:22:33 primary kernel: drbd kvmimages: Terminating drbd_a_kvmimages

Exactly one minute later (!), the remote stops responding (both fail at the
same time?). As on-suspend is configured, all I/O is suspended.

Ok, unusual, but still ok so far.

Jan 16 10:23:34 primary kernel: block drbd2: helper command: /sbin/drbdadm
local-io-error minor-2 exit code 20 (0x1400)

Jan 16 10:23:34 primary kernel: block drbd2: bitmap WRITE of 0 pages took 0
jiffies

Jan 16 10:23:34 primary kernel: block drbd2: 0 KB (0 bits) marked
out-of-sync by on disk bit-map.

Jan 16 10:23:34 primary kernel: block drbd2: disk( Failed -&gt; Diskless )

Jan 16 10:23:34 primary kernel: block drbd2: helper command: /sbin/drbdadm
pri-on-incon-degr minor-2

Jan 16 10:23:34 primary kernel: block drbd2: helper command: /sbin/drbdadm
pri-on-incon-degr minor-2 exit code 0 (0x0)

Jan 16 10:23:34 primary kernel: drbd kvmimages: Connection closed

Jan 16 10:23:34 primary kernel: drbd kvmimages: out of mem, failed to invoke
fence-peer helper

Jan 16 10:23:34 primary kernel: drbd kvmimages: conn( Timeout -&gt;
Unconnected )

Jan 16 10:23:34 primary kernel: drbd kvmimages: receiver terminated

Jan 16 10:23:34 primary kernel: drbd kvmimages: Restarting receiver thread

Jan 16 10:23:34 primary kernel: drbd kvmimages: receiver (re)started

Jan 16 10:23:34 primary kernel: drbd kvmimages: conn( Unconnected -&gt;
WFConnection )

Jan 16 10:23:35 primary kernel: drbd kvmimages: Handshake successful: Agreed
network protocol version 101

Jan 16 10:23:35 primary kernel: drbd kvmimages: Agreed to support TRIM on
protocol level

Jan 16 10:23:35 primary kernel: drbd kvmimages: conn( WFConnection -&gt;
WFReportParams )

Jan 16 10:23:35 primary kernel: drbd kvmimages: Starting asender thread
(from drbd_r_kvmimages [42146])

Jan 16 10:23:35 primary kernel: block drbd2: receiver updated UUIDs to
effective data uuid: C7F11A2B6505F460

Jan 16 10:23:35 primary kernel: block drbd2: peer( Unknown -&gt; Secondary )
conn( WFReportParams -&gt; Connected ) pdsk( DUnknown -&gt; UpToDate )

Jan 16 10:23:35 primary kernel: block drbd2: Should have called
drbd_al_complete_io(, 111041344, 4096), but my Disk seems to have failed :(

Jan 16 10:23:35 primary kernel: drbd kvmimages: susp( 1 -&gt; 0 )

Helpers called, remote reconnected, suspend is lifted.

At this point, Primary is running diskless. Now, here comes the fun part.the
admin (me) notices this after a couple of hours (I was away) and re-attaches
the disk on primary node (as per documentation: "If using internal meta
data, it is sufficient to bind the DRBD device to the new hard disk.").
Here's where everything goes south:

Jan 16 12:22:03 primary kernel: block drbd2: disk( Diskless -&gt; Attaching
)

Jan 16 12:22:03 primary kernel: block drbd2: recounting of set bits took
additional 3 jiffies

Jan 16 12:22:03 primary kernel: block drbd2: 4948 MB (1266688 bits) marked
out-of-sync by on disk bit-map.

Jan 16 12:22:03 primary kernel: block drbd2: disk( Attaching -&gt;
Negotiating )

Jan 16 12:22:03 primary kernel: block drbd2: attached to UUIDs
C7F11A2B6505F461:0000000000000000:9C59495AB871DD6E:2DCB0BB166FC17AB

Jan 16 12:22:03 primary kernel: block drbd2: drbd_sync_handshake:

Jan 16 12:22:03 primary kernel: block drbd2: self
C7F11A2B6505F461:0000000000000000:9C59495AB871DD6E:2DCB0BB166FC17AB
bits:1266688 flags:0

Jan 16 12:22:03 primary kernel: block drbd2: peer
C7F11A2B6505F460:0000000000000000:9C59495AB871DD6E:2DCB0BB166FC17AA
bits:53805 flags:0

Jan 16 12:22:03 primary kernel: block drbd2: uuid_compare()=1 by rule 40

Jan 16 12:22:03 primary kernel: block drbd2: conn( Connected -&gt; WFBitMapS
) disk( Negotiating -&gt; UpToDate ) pdsk( UpToDate -&gt; Consistent )

Why is primary disk now UpToDate, when it was offline for 2 hours? It should
be inconsistent and resynced from the Secondary.

To make matters worse, pull-ahead is triggered at the same time (remember,
replicating over WAN), since 5GB was marked out-of-sync by making the disk
on primary up to date:

Jan 16 12:22:03 primary kernel: block drbd2: conn( Connected -&gt; WFBitMapS
) disk( Negotiating -&gt; UpToDate ) pdsk( UpToDate -&gt; Consistent )

Jan 16 12:22:03 primary kernel: block drbd2: Congestion-fill threshold
reached

Jan 16 12:22:03 primary kernel: block drbd2: conn( WFBitMapS -&gt; Ahead )

Jan 16 12:22:03 primary kernel: block drbd2: send bitmap stats
[Bytes(packets)]: plain 0(0), RLE 4846(2), total 4846; compression: 99.9%

Now the disks stay in this state, the sync is actually not starting:

[root at primary] # cat /proc/drbd

version: 8.4.5 (api:1/proto:86-101)

GIT-hash: 1d360bde0e095d495786eaeb2a1ac76888e4db96 build by phil at Build64R6,
2014-10-28 10:32:53

 1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate A r-----

 ns:48488816 nr:0 dw:106632540 dr:213443889 al:24231 bm:0 lo:0 pe:0 ua:0
ap:0 ep:1 wo:f oos:0

 2: cs:Ahead ro:Primary/Secondary ds:UpToDate/Consistent A r-----

 ns:0 nr:0 dw:564028 dr:163202 al:52275 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f
oos:5190228

 3: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate A r-----

 ns:13699098 nr:0 dw:34361764 dr:248373080 al:3374 bm:0 lo:0 pe:0 ua:0 ap:0
ep:1 wo:f oos:0

[root at secondary] # cat /proc/drbd

version: 8.4.5 (api:1/proto:86-101)

GIT-hash: 1d360bde0e095d495786eaeb2a1ac76888e4db96 build by phil at Build64R6,
2014-10-28 10:32:53

 1: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate A r-----

 ns:0 nr:49240716 dw:61104624 dr:176250476 al:0 bm:0 lo:0 pe:0 ua:0 ap:0
ep:1 wo:f oos:0

 2: cs:Behind ro:Secondary/Primary ds:Outdated/UpToDate A r-----

 ns:53230 nr:46356310 dw:48639878 dr:55173318 al:886 bm:0 lo:0 pe:0 ua:0
ap:0 ep:1 wo:f oos:5241020

 3: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate A r-----

 ns:0 nr:21494090 dw:21494090 dr:195651772 al:0 bm:0 lo:0 pe:0 ua:0 ap:0
ep:1 wo:f oos:0

With both nodes sure, that primary is UpToDate!

Recognizing the problem, I started shutting down services and copying all
the data (whatever it was) off the drbd partitions.somewhere along the
copying, maybe the WAN link failed or something else happened, drbd started
resyncing:

Jan 16 16:36:16 primary kernel: block drbd2: new current UUID
A30F012BA25C9803:C7F11A2B6505F461:9C59495AB871DD6E:2DCB0BB166FC17AB

Jan 16 16:36:38 primary kernel: block drbd2: drbd_sync_handshake:

Jan 16 16:36:38 primary kernel: block drbd2: self
A30F012BA25C9803:C7F11A2B6505F461:9C59495AB871DD6E:2DCB0BB166FC17AB
bits:1299480 flags:0

Jan 16 16:36:38 primary kernel: block drbd2: peer
C7F11A2B6505F460:0000000000000000:9C59495AB871DD6E:2DCB0BB166FC17AA
bits:1312123 flags:0

Jan 16 16:36:38 primary kernel: block drbd2: uuid_compare()=1 by rule 70

Jan 16 16:36:38 primary kernel: block drbd2: peer( Unknown -&gt; Secondary )
conn( WFReportParams -&gt; WFBitMapS ) pdsk( DUnknown -&gt; Consistent )

Jan 16 16:36:38 primary kernel: block drbd2: send bitmap stats
[Bytes(packets)]: plain 0(0), RLE 6711(2), total 6711; compression: 99.8%

Jan 16 16:36:39 primary kernel: block drbd2: receive bitmap stats
[Bytes(packets)]: plain 0(0), RLE 7015(2), total 7015; compression: 99.8%

Jan 16 16:36:39 primary kernel: block drbd2: helper command: /sbin/drbdadm
before-resync-source minor-2

Jan 16 16:36:39 primary kernel: block drbd2: helper command: /sbin/drbdadm
before-resync-source minor-2 exit code 0 (0x0)

Jan 16 16:36:39 primary kernel: block drbd2: conn( WFBitMapS -&gt;
SyncSource ) pdsk( Consistent -&gt; Inconsistent )

Jan 16 16:36:39 primary kernel: block drbd2: Began resync as SyncSource
(will sync 5248492 KB [1312123 bits set]).

Jan 16 16:36:39 primary kernel: block drbd2: updated sync UUID
A30F012BA25C9803:C7F21A2B6505F461:C7F11A2B6505F461:9C59495AB871DD6E

Jan 16 16:41:29 primary kernel: block drbd2: Resync done (total 289 sec;
paused 0 sec; 18160 K/sec)

Jan 16 16:41:29 primary kernel: block drbd2: 69 % had equal checksums,
eliminated: 3672368K; transferred 1576124K total 5248492K

Jan 16 16:41:29 primary kernel: block drbd2: updated UUIDs
A30F012BA25C9803:0000000000000000:C7F21A2B6505F461:C7F11A2B6505F461

Jan 16 16:41:29 primary kernel: block drbd2: conn( SyncSource -&gt;
Connected ) pdsk( Inconsistent -&gt; UpToDate )

.and overwritten approx. 1.5GB data on the secondary from probably stale
data (and new data written to the primary in between) from the primary.

There's nothing in the logs of the Secondary to indicate any disk problems
around 10:22:

Jan 16 10:22:33 secondary kernel: drbd kvmimages: sock was shut down by peer

Jan 16 10:22:33 secondary kernel: drbd kvmimages: peer( Primary -&gt;
Unknown ) conn( Connected -&gt; BrokenPipe ) pdsk( UpToDate -&gt; DUnknown )

Jan 16 10:22:33 secondary kernel: drbd kvmimages: short read (expected size
16)

Jan 16 10:22:33 secondary kernel: drbd kvmimages: meta connection shut down
by peer.

Jan 16 10:22:33 secondary kernel: drbd kvmimages: asender terminated

Jan 16 10:22:33 secondary kernel: drbd kvmimages: Terminating
drbd_a_kvmimages

Jan 16 10:22:33 secondary kernel: drbd kvmimages: Connection closed

Jan 16 10:22:33 secondary kernel: drbd kvmimages: conn( BrokenPipe -&gt;
Unconnected )

Jan 16 10:22:33 secondary kernel: drbd kvmimages: receiver terminated

Jan 16 10:22:34 secondary kernel: drbd kvmimages: Restarting receiver thread

Jan 16 10:22:34 secondary kernel: drbd kvmimages: receiver (re)started

Jan 16 10:22:34 secondary kernel: drbd kvmimages: conn( Unconnected -&gt;
WFConnection )

Jan 16 10:23:35 secondary kernel: drbd kvmimages: Handshake successful:
Agreed network protocol version 101

Jan 16 10:23:35 secondary kernel: drbd kvmimages: Agreed to support TRIM on
protocol level

Jan 16 10:23:35 secondary kernel: drbd kvmimages: conn( WFConnection -&gt;
WFReportParams )

Jan 16 10:23:35 secondary kernel: drbd kvmimages: Starting asender thread
(from drbd_r_kvmimages [11935])

Jan 16 10:23:35 secondary kernel: block drbd2: peer( Unknown -&gt; Primary )
conn( WFReportParams -&gt; Connected ) pdsk( DUnknown -&gt; Diskless )

As far as it is concerned, the primary terminated connection !?

After primary was reattached:

Jan 16 12:22:03 secondary kernel: block drbd2: real peer disk state =
Consistent

Jan 16 12:22:03 secondary kernel: block drbd2: drbd_sync_handshake:

Jan 16 12:22:03 secondary kernel: block drbd2: self
C7F11A2B6505F460:0000000000000000:9C59495AB871DD6E:2DCB0BB166FC17AA bits:0
flags:0

Jan 16 12:22:03 secondary kernel: block drbd2: peer
C7F11A2B6505F461:0000000000000000:9C59495AB871DD6E:2DCB0BB166FC17AB
bits:1266688 flags:2

Jan 16 12:22:03 secondary kernel: block drbd2: uuid_compare()=-1 by rule 40

Jan 16 12:22:03 secondary kernel: block drbd2: conn( Connected -&gt;
WFBitMapT ) disk( UpToDate -&gt; Outdated ) pdsk( Diskless -&gt; UpToDate )

Jan 16 12:22:04 secondary kernel: block drbd2: conn( WFBitMapT -&gt; Behind
)

Jan 16 12:22:04 secondary kernel: block drbd2: receive bitmap stats
[Bytes(packets)]: plain 0(0), RLE 4846(2), total 4846; compression: 99.9%

Jan 16 12:22:04 secondary kernel: block drbd2: unexpected cstate (Behind) in
receive_bitmap

And then happily overwrote it's data:

Jan 16 16:36:38 secondary kernel: block drbd2: drbd_sync_handshake:

Jan 16 16:36:38 secondary kernel: block drbd2: self
C7F11A2B6505F460:0000000000000000:9C59495AB871DD6E:2DCB0BB166FC17AA
bits:1312123 flags:0

Jan 16 16:36:38 secondary kernel: block drbd2: peer
A30F012BA25C9803:C7F11A2B6505F461:9C59495AB871DD6E:2DCB0BB166FC17AB
bits:1299480 flags:2

Jan 16 16:36:38 secondary kernel: block drbd2: uuid_compare()=-1 by rule 50

Jan 16 16:36:38 secondary kernel: block drbd2: peer( Unknown -&gt; Primary )
conn( WFReportParams -&gt; WFBitMapT ) pdsk( DUnknown -&gt; UpToDate )

Jan 16 16:36:39 secondary kernel: block drbd2: receive bitmap stats
[Bytes(packets)]: plain 0(0), RLE 6711(2), total 6711; compression: 99.8%

Jan 16 16:36:39 secondary kernel: block drbd2: send bitmap stats
[Bytes(packets)]: plain 0(0), RLE 7015(2), total 7015; compression: 99.8%

Jan 16 16:36:39 secondary kernel: block drbd2: conn( WFBitMapT -&gt;
WFSyncUUID )

Jan 16 16:36:39 secondary kernel: block drbd2: updated sync uuid
C7F21A2B6505F460:0000000000000000:9C59495AB871DD6E:2DCB0BB166FC17AA

Jan 16 16:36:39 secondary kernel: block drbd2: helper command: /sbin/drbdadm
before-resync-target minor-2

Jan 16 16:36:39 secondary kernel: block drbd2: helper command: /sbin/drbdadm
before-resync-target minor-2 exit code 0 (0x0)

Jan 16 16:36:39 secondary kernel: block drbd2: conn( WFSyncUUID -&gt;
SyncTarget ) disk( Outdated -&gt; Inconsistent )

Jan 16 16:36:39 secondary kernel: block drbd2: Began resync as SyncTarget
(will sync 5248492 KB [1312123 bits set]).

Jan 16 16:41:29 secondary kernel: block drbd2: Resync done (total 289 sec;
paused 0 sec; 18160 K/sec)

Jan 16 16:41:29 secondary kernel: block drbd2: 69 % had equal checksums,
eliminated: 3672368K; transferred 1576124K total 5248492K

Jan 16 16:41:29 secondary kernel: block drbd2: updated UUIDs
A30F012BA25C9802:0000000000000000:C7F21A2B6505F460:C7F11A2B6505F461

Jan 16 16:41:29 secondary kernel: block drbd2: conn( SyncTarget -&gt;
Connected ) disk( Inconsistent -&gt; UpToDate )

Jan 16 16:41:29 secondary kernel: block drbd2: helper command: /sbin/drbdadm
after-resync-target minor-2

Jan 16 16:41:29 secondary kernel: block drbd2: helper command: /sbin/drbdadm
after-resync-target minor-2 exit code 0 (0x0)

Now, sure, the documentation states that disk-timeout is dangerous. I accept
that, 'buy better hardware'.but the point here is, that there doesn't seem
anything wrong with the hardware. And if there was, drbd should be able to
recognize where the UpToDate data is - that's the whole point. The most
interesting part is that this was actually the second occurrence of a very
similar problem. The first was on vastly different hardware (PERC H710 with
BBU.), there's absolutely no way it would timeout:

Jan 12 19:36:30 glinx kernel: block drbd1: Local backing device failed to
meet the disk-timeout

Jan 12 19:36:30 glinx kernel: block drbd1: disk( UpToDate -&gt; Failed )

Jan 12 19:36:30 glinx kernel: block drbd1: Local IO failed in
request_timer_fn. Detaching...

Jan 12 19:36:30 glinx kernel: block drbd1: helper command: /sbin/drbdadm
local-io-error minor-1

Jan 12 19:37:30 glinx kernel: block drbd1: Remote failed to finish a request
within ko-count * timeout

Jan 12 19:37:30 glinx kernel: block drbd1: peer( Secondary -&gt; Unknown )
conn( Connected -&gt; Timeout ) pdsk( UpToDate -&gt; DUnknown )

Jan 12 19:37:30 glinx kernel: drbd dbserver: susp( 0 -&gt; 1 )

Jan 12 19:37:30 glinx kernel: drbd dbserver: asender terminated

Jan 12 19:37:30 glinx kernel: drbd dbserver: Terminating drbd_a_dbserver

Jan 12 19:38:31 glinx kernel: block drbd1: helper command: /sbin/drbdadm
local-io-error minor-1 exit code 20 (0x1400)

Jan 12 19:38:31 glinx kernel: block drbd1: bitmap WRITE of 0 pages took 0
jiffies

Jan 12 19:38:31 glinx kernel: block drbd1: 0 KB (0 bits) marked out-of-sync
by on disk bit-map.

Jan 12 19:38:31 glinx kernel: block drbd1: disk( Failed -&gt; Diskless )

Jan 12 19:38:31 glinx kernel: block drbd1: helper command: /sbin/drbdadm
pri-on-incon-degr minor-1

Jan 12 19:38:31 glinx kernel: block drbd1: helper command: /sbin/drbdadm
pri-on-incon-degr minor-1 exit code 0 (0x0)

Jan 12 19:38:31 glinx kernel: drbd dbserver: Connection closed

Jan 12 19:38:31 glinx kernel: drbd dbserver: out of mem, failed to invoke
fence-peer helper

Jan 12 19:38:31 glinx kernel: drbd dbserver: conn( Timeout -&gt; Unconnected
)

Jan 12 19:38:31 glinx kernel: drbd dbserver: receiver terminated

Jan 12 19:38:31 glinx kernel: drbd dbserver: Restarting receiver thread

Jan 12 19:38:31 glinx kernel: drbd dbserver: receiver (re)started

Jan 12 19:38:31 glinx kernel: drbd dbserver: conn( Unconnected -&gt;
WFConnection )

Jan 12 19:38:32 glinx kernel: drbd dbserver: Handshake successful: Agreed
network protocol version 101

Jan 12 19:38:32 glinx kernel: drbd dbserver: Agreed to support TRIM on
protocol level

Jan 12 19:38:32 glinx kernel: drbd dbserver: conn( WFConnection -&gt;
WFReportParams )

Jan 12 19:38:32 glinx kernel: drbd dbserver: Starting asender thread (from
drbd_r_dbserver [32682])

Jan 12 19:38:32 glinx kernel: block drbd1: receiver updated UUIDs to
effective data uuid: 43734559ED5920C0

Jan 12 19:38:32 glinx kernel: block drbd1: peer( Unknown -&gt; Secondary )
conn( WFReportParams -&gt; Connected ) pdsk( DUnknown -&gt; UpToDate )

Jan 12 19:38:32 glinx kernel: block drbd1: Should have called
drbd_al_complete_io(, 29734872, 4096), but my Disk seems to have failed :(

Jan 12 19:38:32 glinx kernel: drbd dbserver: susp( 1 -&gt; 0 )

Again the same set of events.local disk timeouts, 1 min later remote
timeouts. After reattaching, local device thinks it's UpToDate:

Jan 12 19:53:17 glinx kernel: block drbd1: disk( Diskless -&gt; Attaching )

Jan 12 19:53:17 glinx kernel: block drbd1: recounting of set bits took
additional 4 jiffies

Jan 12 19:53:17 glinx kernel: block drbd1: 4948 MB (1266688 bits) marked
out-of-sync by on disk bit-map.

Jan 12 19:53:17 glinx kernel: block drbd1: disk( Attaching -&gt; Negotiating
)

Jan 12 19:53:17 glinx kernel: block drbd1: attached to UUIDs
43734559ED5920C1:0000000000000000:47C0D52F0FB8F92A:47BFD52F0FB8F92A

Jan 12 19:53:18 glinx kernel: block drbd1: drbd_sync_handshake:

Jan 12 19:53:18 glinx kernel: block drbd1: self
43734559ED5920C1:0000000000000000:47C0D52F0FB8F92A:47BFD52F0FB8F92A
bits:1266688 flags:0

Jan 12 19:53:18 glinx kernel: block drbd1: peer
43734559ED5920C0:0000000000000000:47C0D52F0FB8F92A:47BFD52F0FB8F92A bits:235
flags:0

Jan 12 19:53:18 glinx kernel: block drbd1: uuid_compare()=1 by rule 40

Jan 12 19:53:18 glinx kernel: block drbd1: conn( Connected -&gt; WFBitMapS )
disk( Negotiating -&gt; UpToDate ) pdsk( UpToDate -&gt; Consistent )

Jan 12 19:53:18 glinx kernel: block drbd1: send bitmap stats
[Bytes(packets)]: plain 0(0), RLE 3752(1), total 3752; compression: 100.0%

Jan 12 19:53:18 glinx kernel: block drbd1: receive bitmap stats
[Bytes(packets)]: plain 0(0), RLE 3752(1), total 3752; compression: 100.0%

Jan 12 19:53:18 glinx kernel: block drbd1: helper command: /sbin/drbdadm
before-resync-source minor-1

Jan 12 19:53:18 glinx kernel: block drbd1: helper command: /sbin/drbdadm
before-resync-source minor-1 exit code 0 (0x0)

Jan 12 19:53:18 glinx kernel: block drbd1: conn( WFBitMapS -&gt; SyncSource
) pdsk( Consistent -&gt; Inconsistent )

Jan 12 19:53:18 glinx kernel: block drbd1: Began resync as SyncSource (will
sync 5066752 KB [1266688 bits set]).

SynsSource, how can you be SyncSource if the disk was not attached for 15
mins? Not that this setup is running proto B on local gigabit and no
pull-ahead.

This is far too common to be a coincidence and unless I'm making a big error
somewhere, shouldn't be happening. Also note, that all the helpers are left
in some weird state (this was captured at 16:57):

root

 103848
 0.0
 0.0
 11336

 612 ?

 S

 10:21

 0:00 /bin/bash /usr/lib/drbd/notify-io-error.sh

root

 103849
 0.0
 0.0

 4060

 616 ?

 S

 10:21

 0:00 logger -t notify-io-error.sh[103847] -p local5.info

root

 103931
 0.0
 0.0
 11336

 612 ?

 S

 10:23

 0:00 /bin/bash /usr/lib/drbd/notify-pri-on-incon-degr.sh

root

 103932
 0.0
 0.0

 4060

 616 ?

 S

 10:23

 0:00 logger -t notify-pri-on-incon-degr.sh[103930] -p local5.info

Configuration from the first (WAN) cluster setup:

common {

 handlers {

 pri-on-incon-degr /usr/lib/drbd/notify-pri-on-incon-degr.sh;

 pri-lost-after-sb /usr/lib/drbd/notify-pri-lost-after-sb.sh;

 local-io-error

 &quot;/usr/lib/drbd/notify-io-error.sh; drbdadm detach
$DRBD_RESOURCE&quot;;

 out-of-sync

 &quot;/etc/scripts/drbd-verify.sh out-of-sync&quot;;

 split-brain

 &quot;/usr/lib/drbd/notify-split-brain.sh root&quot;;

 }

 startup {

degr-wfc-timeout

 30;

 outdated-wfc-timeout 30;

 wfc-timeout

 30;

 }

 options {

 on-no-data-accessible suspend-io;

 }

 disk { 

on-io-error

 call-local-io-error;

 disk-timeout

 900;

c-plan-ahead

 20;

 c-fill-target

 3M;

 c-min-rate

 1M;

 c-max-rate

 25M;

 }

 net {

protocol

 A;

 ko-count

 5;

 on-congestion

 pull-ahead;

 congestion-fill

 7000K;

 congestion-extents

 127;

 use-rle

 yes;

 verify-alg

 sha1;

csums-alg

 sha1;

 }

}

resource kvmimages {

 on primary {

 device

 /dev/drbd2;

 disk

 /dev/md22;

 address

 172.16.10.1:7790;

 meta-disk internal;

 }

 on secondary {

 device

 /dev/drbd2;

 disk

 /dev/md6;

 address

 172.16.10.6:7790;

 meta-disk internal;

 }

 net {

 sndbuf-size 10M;

rcvbuf-size 512k;

 }

}

As problems like this never happened on 8.3 (local disk failing without any
obvious reason), I can't say how 8.3 would react in the similar situation
(diskless on primary).

But still if drbd is running diskless on primary, it shouldn't overwrite the
secondary data after it is back online. Surely the reattached disk is not
supposed to be invalidated manually before reattaching?

Also I haven't really checked why handlers are left running in the processes
as I have just reverted back to 8.3.16 for now.

Regards,

Saso Slavicic

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20150116/f5ecdc3f/attachment.htm>