[DRBD-user] drbd-0.7.15 sock_sendmsg time expired for no apparent reason during fsck journal recovery (repost)

Maurice Volaski mvolaski at aecom.yu.edu
Sun Feb 19 20:21:03 CET 2006

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


First posted on 2/7 with no reply. It hasn't happened since, but once 
is enough to mean something is wrong somewhere.....


On the second go around, drbd successfully resynced all 5600 GB
without crashing. =D>

I ran a quick fsck to recover the journals from the earlier crash,
and it worked OK for every resource but one. For that one, it just
sat there. I looked at the log and saw numerous messages like so:

Feb  6 15:18:32 [kernel] drbd6: [fsck.ext3/4455] sock_sendmsg time
expired, ko = 4294967295

Both computers were on and working fine remotely via ssh, except for
this message that kept repeating in the log. Oddly, /proc/drbd on
neither computer reported anything to hint something was wrong.

The fsck process on the primary was hung for about 16 minutes and
then suddenly, the log spurted out...

Feb  6 15:34:07 [kernel] drbd6: sock_recvmsg returned -110
Feb  6 15:34:07 [kernel] drbd6: Connection lost.
Feb  6 15:34:07 [kernel] drbd6: drbd6_receiver [6908]: cstate
WFConnection --> WFReportParams
Feb  6 15:34:07 [kernel] drbd6: Handshake successful: DRBD Network
Protocol version 74
Feb  6 15:34:07 [kernel] drbd6: Connection established.
Feb  6 15:34:07 [kernel] drbd6: I am(P):
1:00000002:00000001:00000003:00000002:10
Feb  6 15:34:07 [kernel] drbd6: Primary/Unknown --> Primary/Secondary
Feb  6 15:34:07 [kernel] drbd6: drbd6_receiver [6908]: cstate
WFBitMapS --> SyncSource
Feb  6 15:34:08 [kernel] drbd6: Resync done (total 1 sec; paused 0
sec; 1800 K/sec)

Here's an excerpt from drbd.conf
protocol C;
incon-degr-cmd "echo '!DRBD! pri on incon-degr' | wall ; sleep 60 ; halt -f";
startup { wfc-timeout 0; degr-wfc-timeout 120; }
disk    { on-io-error detach; }
net     { timeout 60; connect-int 10; ping-int 10;
max-buffers 2048; max-epoch-size 2048; }


Any ideas what could cause drbd to suddenly lose access to the
secondary in this fashion? It came back after 16 minutes. Shouldn't
it have taken some action during that time?
-- 

Maurice Volaski, mvolaski at aecom.yu.edu
Computing Support, Rose F. Kennedy Center
Albert Einstein College of Medicine of Yeshiva University



More information about the drbd-user mailing list