[DRBD-user] drbd-0.7.15 sock_sendmsg time expired for no apparent reason during fsck journal recovery

Maurice Volaski mvolaski at aecom.yu.edu
Mon Feb 6 23:16:49 CET 2006

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


On the second go around, drbd successfully resynced all 5600 GB 
without crashing. =D>

I ran a quick fsck to recover the journals from the earlier crash, 
and it worked OK for every resource but one. For that one, it just 
sat there. I looked at the log and saw numerous messages like so:

Feb  6 15:18:32 [kernel] drbd6: [fsck.ext3/4455] sock_sendmsg time 
expired, ko = 4294967295

Both computers were on and working fine remotely via ssh, except for 
this message that kept repeating in the log. Oddly, /proc/drbd on 
neither computer reported anything to hint something was wrong.

The fsck process on the primary was hung for about 16 minutes and 
then suddenly, the log spurted out...

Feb  6 15:34:07 [kernel] drbd6: sock_recvmsg returned -110
Feb  6 15:34:07 [kernel] drbd6: Connection lost.
Feb  6 15:34:07 [kernel] drbd6: drbd6_receiver [6908]: cstate 
WFConnection --> WFReportParams
Feb  6 15:34:07 [kernel] drbd6: Handshake successful: DRBD Network 
Protocol version 74
Feb  6 15:34:07 [kernel] drbd6: Connection established.
Feb  6 15:34:07 [kernel] drbd6: I am(P): 
1:00000002:00000001:00000003:00000002:10
Feb  6 15:34:07 [kernel] drbd6: Primary/Unknown --> Primary/Secondary
Feb  6 15:34:07 [kernel] drbd6: drbd6_receiver [6908]: cstate 
WFBitMapS --> SyncSource
Feb  6 15:34:08 [kernel] drbd6: Resync done (total 1 sec; paused 0 
sec; 1800 K/sec)

Here's an excerpt from drbd.conf
protocol C;
incon-degr-cmd "echo '!DRBD! pri on incon-degr' | wall ; sleep 60 ; halt -f";
startup { wfc-timeout 0; degr-wfc-timeout 120; }
disk    { on-io-error detach; }
net     { timeout 60; connect-int 10; ping-int 10;
max-buffers 2048; max-epoch-size 2048; }


Any ideas what could cause drbd to suddenly lose access to the 
secondary in this fashion? It came back after 16 minutes. Shouldn't 
it have taken some action during that time?
-- 

Maurice Volaski, mvolaski at aecom.yu.edu
Computing Support, Rose F. Kennedy Center
Albert Einstein College of Medicine of Yeshiva University



More information about the drbd-user mailing list