Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello! I have just noticed a strange behaviour with a drbd setup. After rebooting a node the secondary reconnected and started resync, then after a short time a disk on the secondary started to throw command timeouts... I don't know why the raid controller did not remove the disk but I got a log of error messages and the i/o on the VD was stuck for a long time. Then it recovered and froze again. I observed the behaviour and decided to disconnect the secondary as I/O on the primary was frozen too. But neither disconnect worked on the primary nor on the secondary for the problematic drbd device, so I had to reset the secondary for resolving this problem. The Primary continued without a problem after that. Unfortunately the I/O was stall for at least 3-4 minutes I assume so that I/O errors where thrown in different vms. So I wondererd if there was a possibility to configure a forcibly disconnect for unresponsive resources. I can see the disk-timeout which is described to be dangerous and would detach the drbd device and I can see the timeout option which would disconnect if no packet was received by the peer, but it seems that DRBD either decided to NOT disconnect or could not disconnest for some reason. I can see such messages in the log on the primary: Nov 6 18:06:08 node-a kernel: [4315372.080011] block drbd1: [drbd1_worker/5623] sock_sendmsg time expired, ko = 4294967256 Nov 6 18:06:14 node-a kernel: [4315378.080014] block drbd1: [drbd1_worker/5623] sock_sendmsg time expired, ko = 4294967255 Nov 6 18:06:20 node-a kernel: [4315384.080011] block drbd1: [drbd1_worker/5623] sock_sendmsg time expired, ko = 4294967254 Nov 6 18:06:26 node-a kernel: [4315390.080009] block drbd1: [drbd1_worker/5623] sock_sendmsg time expired, ko = 4294967253 Nov 6 18:06:32 node-a kernel: [4315396.080009] block drbd1: [drbd1_worker/5623] sock_sendmsg time expired, ko = 4294967252 Nov 6 18:06:38 node-a kernel: [4315402.080010] block drbd1: [drbd1_worker/5623] sock_sendmsg time expired, ko = 4294967251 Nov 6 18:06:44 node-a kernel: [4315408.080013] block drbd1: [drbd1_worker/5623] sock_sendmsg time expired, ko = 4294967250 Nov 6 18:06:50 node-a kernel: [4315414.080011] block drbd1: [drbd1_worker/5623] sock_sendmsg time expired, ko = 4294967249 Nov 6 18:06:56 node-a kernel: [4315420.080012] block drbd1: [drbd1_worker/5623] sock_sendmsg time expired, ko = 4294967248 Nov 6 18:07:02 node-a kernel: [4315426.080011] block drbd1: [drbd1_worker/5623] sock_sendmsg time expired, ko = 4294967247 Nov 6 18:10:40 node-a kernel: [4315644.236018] block drbd1: [drbd1_worker/5623] sock_sendmsg time expired, ko = 4294967295 Nov 6 18:11:18 node-a kernel: [4315682.088012] block drbd1: [drbd1_worker/5623] sock_sendmsg time expired, ko = 4294967295 Nov 6 18:11:24 node-a kernel: [4315688.088009] block drbd1: [drbd1_worker/5623] sock_sendmsg time expired, ko = 4294967294 Nov 6 18:11:30 node-a kernel: [4315694.088009] block drbd1: [drbd1_worker/5623] sock_sendmsg time expired, ko = 4294967293 Nov 6 18:11:36 node-a kernel: [4315700.088009] block drbd1: [drbd1_worker/5623] sock_sendmsg time expired, ko = 4294967292 Nov 6 18:11:42 node-a kernel: [4315706.088010] block drbd1: [drbd1_worker/5623] sock_sendmsg time expired, ko = 4294967291 But the node remained in connected / Uptodate/Inconsistent state until I reset the peer. How can such a behaviour avoided? Thank you all in advance, regards, Felix