[DRBD-user] drbd 9 resource failure due to apparent local io failure, but odd

Wed Oct 25 13:51:31 CEST 2017

Hello,

I was wondering if someone could eyeball the logs further below from a 
resource that has completely failed over yesterday and today and tell me 
if it looks like a "normal" failure from underlying storage, or if there 
is anything strange?

I ask because there are 2 things that are odd:

1. On the primary node drbd reports that the underlying storage fails 
for the resource (1 out of 27, the rest all fine and healthy) on node 1, 
yet there are NO reports of failure from the underlying storage, which 
happens to be a block device used by other (still healthy) resources 
(the drbd backing devices are all logical volumes). The resource goes 
Diskless on the primary but service continues because of the secondary 
which is still fine.

2. 11 hours later, the same happens on the secondary node (different 
machine, different physical storage), drbd reports read failure from 
local storage there (also lvms over block device, the other resources 
also fine), yet no reports of failure from underlying storage. This is 
of course the nail in the coffin for the resource as both resources are 
now Diskless . Again, all other resources that share same block device 
are still fine and 100% healthy, no signs of any other issues on either 
node.

Both nodes are drbd-9.0.9-1 from drbd.org on vanilla kernel.org kernel 
4.9.58.

The failed resource has existed without any problems for many weeks, but 
was originally created with drbd 9-0.8-1 on vanilla kernel 4.4.77. Both 
nodes were upgraded to drbd-9.0.9-1/4.9.58 a few days ago. I don't know 
if this is significant in any way.

Lastly, the failed resource is still there, both sides in Diskless 
state, is there anything I can poke, maybe in /sys/kernel/debug, that 
might give further info about what happened?

Thanks,
Eddie

Here is the log from the primary node when the first failure happened:

drbd RES7H10E/0 drbd42: local READ IO error sector 6192640+16 on dm-43
drbd RES7H10E/0 drbd42: disk( UpToDate -> Failed )
drbd RES7H10E/0 drbd42: Local IO failed in __req_mod. Detaching...
drbd RES7H10E/0 drbd42: sending new current UUID: 2A150CB88CD794F6
drbd RES7H10E/0 drbd42: disk( Failed -> Diskless )
drbd RES7H10E/0 drbd42: Should have called drbd_al_complete_io(, 
1486856, 4096), but my Disk seems to have failed :(
drbd RES7H10E/0 drbd42: Should have called drbd_al_complete_io(, 
1486888, 20480), but my Disk seems to have failed :(
drbd RES7H10E/0 drbd42: Should have called drbd_al_complete_io(, 
1486928, 57344), but my Disk seems to have failed :(
drbd RES7H10E/0 drbd42: Should have called drbd_al_complete_io(, 
1478264, 4096), but my Disk seems to have failed :(
drbd RES7H10E/0 drbd42: Should have called drbd_al_complete_io(, 
1486704, 77824), but my Disk seems to have failed :(
drbd RES7H10E/0 drbd42: Should have called drbd_al_complete_io(, 
1486864, 12288), but my Disk seems to have failed :(
drbd RES7H10E/0 drbd42: Should have called drbd_al_complete_io(, 
1487048, 286720), but my Disk seems to have failed :(
drbd RES7H10E/0 drbd42: Should have called drbd_al_complete_io(, 
1487608, 262144), but my Disk seems to have failed :(
drbd RES7H10E/0 drbd42: Should have called drbd_al_complete_io(, 
41556976, 61440), but my Disk seems to have failed :(
drbd RES7H10E/0 drbd42: Should have called drbd_al_complete_io(, 
1479376, 4096), but my Disk seems to have failed :(

And the log from the secondary node failure exactly 11 hours later:

drbd RES7H10E/0 drbd42: read: error=10 s=29090424s
drbd RES7H10E/0 drbd42: disk( UpToDate -> Failed )
drbd RES7H10E/0 drbd42: Local IO failed in drbd_endio_read_sec_final. 
Detaching...
drbd RES7H10E/0 drbd42 node1.mydomain: Sending NegDReply. sector=29090424s.
drbd RES7H10E/0 drbd42: disk( Failed -> Diskless )
drbd RES7H10E node1.mydomain: Wrong magic value 0x0090d574 in protocol 
version 112
drbd RES7H10E node1.mydomain: conn( Connected -> ProtocolError ) peer( 
Primary -> Unknown )
drbd RES7H10E/0 drbd42 node1.mydomain: pdsk( Diskless -> DUnknown ) 
repl( Established -> Off )
drbd RES7H10E node1.mydomain: ack_receiver terminated
drbd RES7H10E node1.mydomain: Terminating ack_recv thread
drbd RES7H10E node1.mydomain: Connection closed
drbd RES7H10E node1.mydomain: conn( ProtocolError -> Unconnected )
drbd RES7H10E node1.mydomain: Restarting receiver thread
drbd RES7H10E node1.mydomain: conn( Unconnected -> Connecting )