[DRBD-user] Forcibly disconnect if secondary is responding too slow?

Thu Nov 6 19:29:03 CET 2014

Hello!

I have just noticed a strange behaviour with a drbd setup.

After rebooting a node the secondary reconnected and started resync, 
then after a short time a disk on the secondary started to throw command 
timeouts... I don't know why the raid controller did not remove the disk 
but I got a log of error messages and the i/o on the VD was stuck for a 
long time. Then it recovered and froze again.

I observed the behaviour and decided to disconnect the secondary as I/O 
on the primary was frozen too. But neither disconnect worked on the 
primary nor on the secondary for the problematic drbd device, so I had 
to reset the secondary for resolving this problem. The Primary continued 
without a problem after that. Unfortunately the I/O was stall for at 
least 3-4 minutes I assume so that I/O errors where thrown in different vms.

So I wondererd if there was a possibility to configure a forcibly 
disconnect for unresponsive resources. I can see the disk-timeout which 
is described to be dangerous and would detach the drbd device and I can 
see the timeout option which would disconnect if no packet was received 
by the peer, but it seems that DRBD either decided to NOT disconnect or 
could not disconnest for some reason.

I can see such messages in the log on the primary:

Nov  6 18:06:08 node-a kernel: [4315372.080011] block drbd1: 
[drbd1_worker/5623] sock_sendmsg time expired, ko = 4294967256
Nov  6 18:06:14 node-a kernel: [4315378.080014] block drbd1: 
[drbd1_worker/5623] sock_sendmsg time expired, ko = 4294967255
Nov  6 18:06:20 node-a kernel: [4315384.080011] block drbd1: 
[drbd1_worker/5623] sock_sendmsg time expired, ko = 4294967254
Nov  6 18:06:26 node-a kernel: [4315390.080009] block drbd1: 
[drbd1_worker/5623] sock_sendmsg time expired, ko = 4294967253
Nov  6 18:06:32 node-a kernel: [4315396.080009] block drbd1: 
[drbd1_worker/5623] sock_sendmsg time expired, ko = 4294967252
Nov  6 18:06:38 node-a kernel: [4315402.080010] block drbd1: 
[drbd1_worker/5623] sock_sendmsg time expired, ko = 4294967251
Nov  6 18:06:44 node-a kernel: [4315408.080013] block drbd1: 
[drbd1_worker/5623] sock_sendmsg time expired, ko = 4294967250
Nov  6 18:06:50 node-a kernel: [4315414.080011] block drbd1: 
[drbd1_worker/5623] sock_sendmsg time expired, ko = 4294967249
Nov  6 18:06:56 node-a kernel: [4315420.080012] block drbd1: 
[drbd1_worker/5623] sock_sendmsg time expired, ko = 4294967248
Nov  6 18:07:02 node-a kernel: [4315426.080011] block drbd1: 
[drbd1_worker/5623] sock_sendmsg time expired, ko = 4294967247
Nov  6 18:10:40 node-a kernel: [4315644.236018] block drbd1: 
[drbd1_worker/5623] sock_sendmsg time expired, ko = 4294967295
Nov  6 18:11:18 node-a kernel: [4315682.088012] block drbd1: 
[drbd1_worker/5623] sock_sendmsg time expired, ko = 4294967295
Nov  6 18:11:24 node-a kernel: [4315688.088009] block drbd1: 
[drbd1_worker/5623] sock_sendmsg time expired, ko = 4294967294
Nov  6 18:11:30 node-a kernel: [4315694.088009] block drbd1: 
[drbd1_worker/5623] sock_sendmsg time expired, ko = 4294967293
Nov  6 18:11:36 node-a kernel: [4315700.088009] block drbd1: 
[drbd1_worker/5623] sock_sendmsg time expired, ko = 4294967292
Nov  6 18:11:42 node-a kernel: [4315706.088010] block drbd1: 
[drbd1_worker/5623] sock_sendmsg time expired, ko = 4294967291

But the node remained in connected / Uptodate/Inconsistent state until I 
reset the peer. How can such a behaviour avoided?

Thank you all in advance,
regards, Felix