[DRBD-user] Hang in dstate Negotiating after unrelated data abort

Thu Jul 3 08:52:07 CEST 2014

Hello,

I ran into some undesired behavior from DRBD earlier today -- I'm running 8.4.4 on CentOS 6.5 (kernel 2.6.32-431.17.1.el6.x86_64), with a fairly vanilla two-node, single-primary config replicating over a back-to-back 10GbE link.

The DRBD resource in question holds a backup volume that we rotate a disk from periodically to take offsite, replacing it with the previously-offsite disk and then resyncing to that.  On the node (let's call it node0, and the other node1) that I was swapping the drive into/out of, I ran 'drbdadm detach' on the resource, leaving it temporarily diskless, and then 'drbdadm attach' with the newly-swapped-in disk once it was up and ready.  Given that the freshly-attached backing device would be from an older state of the same resource, I was hoping that DRBD would recognize this and do a bitmap-based resync of only the data that had been modified since that drive was most recently rotated out (I later realized this part of it wasn't likely to work out as I'd hoped).

Instead what happened when I re-attached was that node1 (the peer, not the one the detach/attach was done on) reported observing unrelated data and aborted, while node0 remained stuck in dstate Negotiating, where it appeared happy to remain indefinitely.

Full kernel output from the relevant period of time on node0:

Jul  2 11:11:09 block drbd5: disk( UpToDate -> Failed ) 
Jul  2 11:11:09 block drbd5: bitmap WRITE of 0 pages took 1 jiffies
Jul  2 11:11:09 block drbd5: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
Jul  2 11:11:09 block drbd5: disk( Failed -> Diskless ) 
Jul  2 11:11:09 block drbd5: receiver updated UUIDs to effective data uuid: 7D883041AC11CDF4
Jul  2 11:11:10 sd 7: 0:6:0: [sdg] Synchronizing SCSI cache
Jul  2 11:11:10 sd 7: 0:6:0: [sdg] Stopping disk
Jul  2 11:12:20 node0 ata18.00: ATA-8: ST3000VX000-9YW166, CV13, max UDMA/133
Jul  2 11:12:20 node0 ata18.00: 5860533168 sectors, multi 0: LBA48 NCQ (depth 31/32)
Jul  2 11:12:20 node0 ata18.00: configured for UDMA/133
Jul  2 11:12:20 scsi 7: 0:7:0: Direct-Access     ATA      ST3000VX000-9YW1 CV13 PQ: 0 ANSI: 5
Jul  2 11:12:20 sd 7: 0:7:0: [sdg] 5860533168 512-byte logical blocks: (3.00 TB/2.72 TiB)
Jul  2 11:12:20 sd 7: 0:7:0: Attached scsi generic sg6 type 0
Jul  2 11:12:20 sd 7: 0:7:0: [sdg] 4096-byte physical blocks
Jul  2 11:12:20 sd 7: 0:7:0: [sdg] Write Protect is off
Jul  2 11:12:20 sd 7: 0:7:0: [sdg] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Jul  2 11:12:20 node0  sdg:
Jul  2 11:12:20 node0  unknown partition table
Jul  2 11:12:20 sd 7: 0:7:0: [sdg] Attached SCSI disk
Jul  2 11:12:24 block drbd5: disk( Diskless -> Attaching ) 
Jul  2 11:12:25 block drbd5: bitmap READ of 22356 pages took 980 jiffies
Jul  2 11:12:25 block drbd5: recounting of set bits took additional 56 jiffies
Jul  2 11:12:25 block drbd5: 13 GB (3470336 bits) marked out-of-sync by on disk bit-map.
Jul  2 11:12:25 block drbd5: disk( Attaching -> Negotiating ) 
Jul  2 11:12:25 block drbd5: attached to UUIDs F7D979074ACFA28E:0000000000000000:773EDC111473E72E:773DDC111473E72E
Jul  2 12:22:50 block drbd5: disk( Negotiating -> Failed ) 

'drbdadm detach' at 11:11:09, sdg deleted at 11:11:10, new disk recognized at 11:12:20, then 'drbdadm attach' at 11:12:24 -- note the ~70 minute gap between the last two lines; the Negotiating->Failed transition was me manually intervening.

And the same window of time on node1:

Jul  2 11:11:09 block drbd5: pdsk( UpToDate -> Failed ) 
Jul  2 11:11:09 block drbd5: new current UUID 7D883041AC11CDF4:3D2B593201711886:AB54E07C096FBAAC:AB53E07C096FBAAC
Jul  2 11:11:09 block drbd5: pdsk( Failed -> Diskless ) 
Jul  2 11:12:25 block drbd5: real peer disk state = Consistent
Jul  2 11:12:25 block drbd5: drbd_sync_handshake:
Jul  2 11:12:25 block drbd5: self 7D883041AC11CDF4:3D2B593201711886:AB54E07C096FBAAC:AB53E07C096FBAAC bits:0 flags:0
Jul  2 11:12:25 block drbd5: peer F7D979074ACFA28F:0000000000000000:773EDC111473E72E:773DDC111473E72E bits:3470336 flags:2
Jul  2 11:12:25 block drbd5: uuid_compare()=-1000 by rule 100
Jul  2 11:12:25 block drbd5: Unrelated data, aborting!
Jul  2 11:12:25 block drbd5: Disk attach process on the peer node was aborted.
Jul  2 12:22:50 block drbd5: pdsk( Diskless -> Failed ) 

Between 11:12:25 and 12:22:50, /proc/drbd looks like this on node0:

version: 8.4.4 (api:1/proto:86-101)
GIT-hash: 599f286440bd633d15d5ff985204aff4bccffadd build by phil at Build64R6, 2013-10-14 15:33:06
...
 5: cs:Connected ro:Primary/Secondary ds:Negotiating/UpToDate C r---d-
    ns:0 nr:0 dw:0 dr:0 al:108344 bm:182159 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:13881344

And like this on node1:

version: 8.4.4 (api:1/proto:86-101)
GIT-hash: 599f286440bd633d15d5ff985204aff4bccffadd build by phil at Build64R6, 2013-10-14 15:33:06
...
 5: cs:Connected ro:Secondary/Primary ds:UpToDate/Diskless C r-----
    ns:824333588 nr:1053169504 dw:1053169504 dr:824334636 al:9 bm:366654 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0

I tried detaching and re-attaching again (without any disk-swapping) thinking perhaps I'd just hit some narrow race window, but got the same behavior on both sides.  At that point I gave up and just did a 'create-md' on node0 and resynced from scratch, which it (thankfully) did without fuss.

So overall, two things:

1) Getting stuck in dstate=Negotiating definitely looks like a DRBD bug to me; is this a known issue, and perhaps something fixed in 8.4.5?  I scanned the changelog summary in the announcement email but didn't see anything that looked (to me) obviously related.

2) Reading (and perhaps more importantly *thinking*) a bit about DRBD's metadata, I guess my hopes of magically bitmap-optimized resync were a bit unrealistic, given the somewhat nontrivial orchestration of what bitmaps would have to get written to which disks and when, and the fact that I don't think DRBD is really designed to support this use case efficiently.  Just in case I'm wrong about that, is there any simple way to leverage DRBD's bitmaps for more efficient resyncs in this scenario, or am I going to have to just do a full-device sync on each disk-swap?

Thanks,
Zev Weiss