[DRBD-user] Resource hangs with "time expired" errors

AZ 9901 az9901 at gmail.com
Thu Jun 20 22:47:51 CEST 2013

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hello,

I already faced this issue sporadically a few months ago, it occured again last night.
Here is what happens.



Online verification is running (as a weekly basis) :

root at srv2-1:~# cat /proc/drbd 
version: 8.3.15 (api:88/proto:86-97)
GIT-hash: 0ce4d235fc02b5c53c1c52c53433d11a694eab8c build by root at srv2-1, 2013-05-20 13:24:15
 1: cs:VerifyT ro:Primary/Secondary ds:UpToDate/UpToDate C r---d-
    ns:1717212 nr:0 dw:1720216 dr:844934114 al:7737 bm:0 lo:1 pe:4238 ua:2048 ap:2049 ep:1 wo:b oos:0
    [=======>............] verified: 41.9% (1105704/1902544)M
    finish: 10095:28:53 speed: 28 (9,648) K/sec

With the following settings :
syncer {
  rate 10M;
  verify-alg crc32c;
}

During this verification, primary's network input rate is about 3Mbps, output rate 1Mbps (out of 100Mbps).



Some activity starts on the resource, taking network rate between 4Mbps and 10Mbps (out of 100Mbps).
After about one hour, resource totally hangs, read and write are impossible, even a simple "ls" hangs.
Many many errors like the following one appear in the syslog :
Jun 19 21:08:10 srv2-1 kernel: block drbd1: [drbd1_worker/26788] sock_sendmsg time expired, ko = 4294967295



At this moment, to take the resource back to production, the only solution I found is to stop network communication between the two nodes (using netfilter/iptables).
Well, I did not think about testing "drbdadm disconnect".
I initially tested "/etc/init.d/drbd stop" on the secondary node, but it hung until network communication was cut.



Questions :

1 - Is there a bug that makes DRBD / online verification as if it was in a infinite loop, giving "sock_sendmsg time expired" messages ?
2 - Could it be possible for the DRBD team to investigate on that ?
3 - As a workaround, it there any DRBD configuration possible that would for example make the primary StandAlone (disconnect) in case of this error ?



Of course, thank you very much for your support !

Best regards,

Ben




More information about the drbd-user mailing list