[DRBD-user] DRBD on top of mdraid troubles

Athanasios Chatziathanassiou tchatzi at arx.net
Wed Mar 15 14:16:20 CET 2023


Hello,

long time DRBD version 8.4 user, I thought I'd give DRBD 9.2 a try.
Setup is typical active-passive 2 node with all flash 6 device mdraid 10 
as the lower level storage and 10Gb ethernet between them.
My problem is that drbd appears to randomly detach the lower level 
storage on the secondary node.

(1) Below is the kernel log from such a typical case
after adding disable-write-same yes; to the resource, the initial sync 
completed and I thought I was done, but still I eventually find the 
secondary Diskless without serious warnings this time (2).

It appears to happen only on the secondary, regardless of what node that 
is, the md array appears totally healthy all the time (also checked with 
``echo check > /sys/block/md0/md/sync_action'' which completed without 
issues).

drbdadm attach raid10_ssd
will always work and end up UpToDate until ending up Diskless later, 
apparently regardless of drbd/device activity.

I've tried various kernels, currently using stock 5.15.102 with 
drbd-9.2.2 tar.gz from linbit's site, all of which made no real difference.
The hardware also appears to have no issues when tested individually. 
They are pretty standard Dell R420s with H310 mini SAS controllers. The 
controllers did make me somewhat suspicious and I flashed them with 
stock IT LSI/Avago firmware, but that didn't make any difference as well.

Any ideas where to look next ?

Best Regards,
Thanos Chatziathanassiou

(1)
---8<---
node1 primary:
drbd raid10_ssd node2: conn( Unconnected -> Connecting )
drbd raid10_ssd node2: Handshake to peer 1 successful: Agreed network 
protocol version 121
drbd raid10_ssd node2: Feature flags enabled on protocol level: 0x3f 
TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES RESYNC_DAGTAG
drbd raid10_ssd node2: Peer authenticated using 20 bytes HMAC
drbd raid10_ssd: Preparing cluster-wide state change 247878758 (0->1 
499/145)
drbd raid10_ssd/0 drbd1 node2: drbd_sync_handshake:
drbd raid10_ssd/0 drbd1 node2: self 
5E9660F0616814EC:5E0F3ECFB8B9124F:94F3C3D0280CC69C:3B9001DA9BF94284 
bits:1720597989 flags:120
drbd raid10_ssd/0 drbd1 node2: peer 
5E0F3ECFB8B9124E:0000000000000000:94F3C3D0280CC69C:5CBB72884848CF5C 
bits:1720607970 flags:1024
drbd raid10_ssd/0 drbd1 node2: uuid_compare()=source-use-bitmap by 
rule=bitmap-self
drbd raid10_ssd: State change 247878758: primary_nodes=1, 
weak_nodes=FFFFFFFFFFFFFFFC
drbd raid10_ssd: Committing cluster-wide state change 247878758 (128ms)
drbd raid10_ssd node2: conn( Connecting -> Connected ) peer( Unknown -> 
Secondary )
drbd raid10_ssd/0 drbd1 node2: pdsk( DUnknown -> Inconsistent ) repl( 
Off -> WFBitMapS )
drbd raid10_ssd/0 drbd1 node2: send bitmap stats [Bytes(packets)]: plain 
0(0), RLE 60(1), total 60; compression: 100.0%
drbd raid10_ssd/0 drbd1 node2: receive bitmap stats [Bytes(packets)]: 
plain 0(0), RLE 57(1), total 57; compression: 100.0%
drbd raid10_ssd/0 drbd1 node2: helper command: /sbin/drbdadm 
before-resync-source
drbd raid10_ssd/0 drbd1 node2: helper command: /sbin/drbdadm 
before-resync-source exit code 0
drbd raid10_ssd/0 drbd1 node2: repl( WFBitMapS -> SyncSource )
drbd raid10_ssd/0 drbd1 node2: Began resync as SyncSource (will sync 
6882431920 KB [1720607980 bits set]).
drbd raid10_ssd/0 drbd1 node2: ASSERTION __dec_rs_pending(peer_device) 
 >= 0 FAILED in got_BlockAck
drbd raid10_ssd/0 drbd1 node2: ASSERTION __dec_rs_pending(peer_device) 
 >= 0 FAILED in got_BlockAck

node2:
drbd raid10_ssd node1: Preparing remote state change 247878758
drbd raid10_ssd/0 drbd1 node1: drbd_sync_handshake:
drbd raid10_ssd/0 drbd1 node1: self 
5E0F3ECFB8B9124E:0000000000000000:94F3C3D0280CC69C:5CBB72884848CF5C 
bits:1720607970 flags:24
drbd raid10_ssd/0 drbd1 node1: peer 
5E9660F0616814EC:5E0F3ECFB8B9124F:94F3C3D0280CC69C:3B9001DA9BF94284 
bits:1720597989 flags:1120
drbd raid10_ssd/0 drbd1 node1: uuid_compare()=target-use-bitmap by 
rule=bitmap-peer
drbd raid10_ssd node1: Committing remote state change 247878758 
(primary_nodes=1)
drbd raid10_ssd node1: conn( Connecting -> Connected ) peer( Unknown -> 
Primary )
drbd raid10_ssd/0 drbd1 node1: pdsk( DUnknown -> UpToDate ) repl( Off -> 
WFBitMapT )
drbd raid10_ssd/0 drbd1 node1: receive bitmap stats [Bytes(packets)]: 
plain 0(0), RLE 60(1), total 60; compression: 100.0%
drbd raid10_ssd/0 drbd1 node1: send bitmap stats [Bytes(packets)]: plain 
0(0), RLE 57(1), total 57; compression: 100.0%
drbd raid10_ssd/0 drbd1 node1: helper command: /sbin/drbdadm 
before-resync-target
drbd raid10_ssd/0 drbd1 node1: helper command: /sbin/drbdadm 
before-resync-target exit code 0
drbd raid10_ssd/0 drbd1 node1: repl( WFBitMapT -> SyncTarget )
drbd raid10_ssd/0 drbd1 node1: Began resync as SyncTarget (will sync 
6882431920 KB [1720607980 bits set]).
drbd raid10_ssd/0 drbd1: disk( Inconsistent -> Failed )
drbd raid10_ssd/0 drbd1 node1: repl( SyncTarget -> Established )
drbd raid10_ssd/0 drbd1: Local IO failed in drbd_endio_write_sec_final. 
Detaching...
---8<---

(2)
---8<---
node1:
drbd raid10_ssd/0 drbd1: disk( Attaching -> Negotiating )
drbd raid10_ssd/0 drbd1: attached to current UUID: D425F4E26AD4E468
drbd raid10_ssd/0 drbd1 node2: drbd_sync_handshake:
drbd raid10_ssd/0 drbd1 node2: self 
D425F4E26AD4E468:0000000000000000:55F0052868F7EB1E:63CE2305E9B88A18 
bits:0 flags:0
drbd raid10_ssd/0 drbd1 node2: peer 
04699142243419D5:D425F4E26AD4E469:D38998C94E149B92:7BEE5673239AA93E 
bits:239639 flags:1120
drbd raid10_ssd/0 drbd1 node2: uuid_compare()=target-use-bitmap by 
rule=bitmap-peer
drbd raid10_ssd/0 drbd1: disk( Negotiating -> Inconsistent )
drbd raid10_ssd/0 drbd1 node2: repl( Established -> WFBitMapT )
drbd raid10_ssd/0 drbd1 node2: receive bitmap stats [Bytes(packets)]: 
plain 0(0), RLE 7602(2), total 7602; compression: 100.0%
drbd raid10_ssd/0 drbd1 node2: send bitmap stats [Bytes(packets)]: plain 
0(0), RLE 7611(2), total 7611; compression: 100.0%
drbd raid10_ssd/0 drbd1 node2: helper command: /sbin/drbdadm 
before-resync-target
drbd raid10_ssd/0 drbd1 node2: helper command: /sbin/drbdadm 
before-resync-target exit code 0
drbd raid10_ssd/0 drbd1 node2: repl( WFBitMapT -> SyncTarget )
drbd raid10_ssd/0 drbd1 node2: Began resync as SyncTarget (will sync 
970840 KB [242710 bits set]).
drbd raid10_ssd/0 drbd1 node2: Resync done (total 57 sec; paused 0 sec; 
17352 K/sec)
drbd raid10_ssd/0 drbd1 node2: updated UUIDs 
04699142243419D4:0000000000000000:55F0052868F7EB1E:63CE2305E9B88A18
drbd raid10_ssd/0 drbd1: disk( Inconsistent -> UpToDate )
drbd raid10_ssd/0 drbd1 node2: repl( SyncTarget -> Established )
drbd raid10_ssd/0 drbd1 node2: helper command: /sbin/drbdadm 
after-resync-target
drbd raid10_ssd/0 drbd1 node2: helper command: /sbin/drbdadm 
after-resync-target exit code 0
drbd raid10_ssd/0 drbd1: disk( UpToDate -> Failed )
drbd raid10_ssd/0 drbd1: Local IO failed in drbd_endio_write_sec_final. 
Detaching...
drbd raid10_ssd/0 drbd1: disk( Failed -> Diskless )
drbd raid10_ssd/0 drbd1: receiver updated UUIDs to exposed data uuid: 
59A2664E6A0DAC91

node2 primary:
drbd raid10_ssd node1: Preparing remote state change 171546960
drbd raid10_ssd node1: Committing remote state change 171546960 
(primary_nodes=2)
drbd raid10_ssd/0 drbd1 node1: pdsk( Diskless -> Negotiating )
drbd raid10_ssd/0 drbd1 node1: real peer disk state = Inconsistent
drbd raid10_ssd/0 drbd1 node1: drbd_sync_handshake:
drbd raid10_ssd/0 drbd1 node1: self 
04699142243419D5:D425F4E26AD4E469:D38998C94E149B92:7BEE5673239AA93E 
bits:239639 flags:120
drbd raid10_ssd/0 drbd1 node1: peer 
D425F4E26AD4E468:0000000000000000:55F0052868F7EB1E:63CE2305E9B88A18 
bits:3072 flags:1004
drbd raid10_ssd/0 drbd1 node1: uuid_compare()=source-use-bitmap by 
rule=bitmap-self
drbd raid10_ssd/0 drbd1 node1: pdsk( Negotiating -> Inconsistent ) repl( 
Established -> WFBitMapS )
drbd raid10_ssd/0 drbd1 node1: send bitmap stats [Bytes(packets)]: plain 
0(0), RLE 7602(2), total 7602; compression: 100.0%
drbd raid10_ssd/0 drbd1 node1: receive bitmap stats [Bytes(packets)]: 
plain 0(0), RLE 7611(2), total 7611; compression: 100.0%
drbd raid10_ssd/0 drbd1 node1: helper command: /sbin/drbdadm 
before-resync-source
drbd raid10_ssd/0 drbd1 node1: helper command: /sbin/drbdadm 
before-resync-source exit code 0
drbd raid10_ssd/0 drbd1 node1: repl( WFBitMapS -> SyncSource )
drbd raid10_ssd/0 drbd1 node1: Began resync as SyncSource (will sync 
989192 KB [247298 bits set]).
drbd raid10_ssd/0 drbd1 node1: updated UUIDs 
04699142243419D5:0000000000000000:D425F4E26AD4E468:D38998C94E149B92
drbd raid10_ssd/0 drbd1 node1: Resync done (total 58 sec; paused 0 sec; 
17052 K/sec)
drbd raid10_ssd/0 drbd1 node1: pdsk( Inconsistent -> UpToDate ) repl( 
SyncSource -> Established )
drbd raid10_ssd/0 drbd1 node1: pdsk( UpToDate -> Failed )
drbd raid10_ssd/0 drbd1: new current UUID: 59A2664E6A0DAC91 weak: 
FFFFFFFFFFFFFFFD
drbd raid10_ssd/0 drbd1 node1: pdsk( Failed -> Diskless )
---8<---



More information about the drbd-user mailing list