[DRBD-user] can drbd be made to detect that it has failed to write to the underlying device in a 'long time'?

Todd Denniston Todd.Denniston at ssa.crane.navy.mil
Wed Apr 14 22:09:10 CEST 2004

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Lars Ellenberg wrote:
> 
> / 2004-04-13 16:48:09 -0500
> \ Todd Denniston:
> > Darn, I was not doing anything using the disk at the time it had a problem
> > today, and missed the slow spot.
> >
> > on a positive note, about 22 seconds into the secondaries Card dump (I think
> > this is the time that the lockup starts) I got the following on the active
> > primary:
> > Apr 13 15:36:23 foo kernel: drbd1: sock_sendmsg time expired on sock
> > Apr 13 15:36:23 foo kernel: drbd1: no data sent since 10 ping intervals, peer
> > seems knocked out: going to StandAlone.
> > Apr 13 15:36:23 foo kernel: drbd1: Connection lost.
> 
> Thats what I expect.
> 
I kind of wonder why I only get that on one of the 7 drbd*'s I have though.
the rest I just get 'Connection lost.' messages over the next couple of
minutes, instead of all at once.

Apr 14 13:38:24 foo kernel: drbd1: sock_sendmsg time expired on sock
Apr 14 13:38:24 foo kernel: drbd1: no data sent since 10 ping intervals, peer
seems knocked out: going to StandAlone.
Apr 14 13:38:24 foo kernel: drbd1: Connection lost.
Apr 14 13:40:04 foo kernel: drbd5: Connection lost.
Apr 14 13:40:05 foo kernel: drbd2: Connection lost.
Apr 14 13:40:05 foo kernel: drbd4: Connection lost.
Apr 14 13:40:10 foo kernel: drbd6: Connection lost.
Apr 14 13:40:10 foo kernel: drbd0: Connection lost.
Apr 14 13:40:10 foo kernel: drbd3: Connection lost.
Apr 14 13:41:57 foo ipfail[4144]: info: Status update: Node bar now has status
dead


> > so was the 'ko count down' messages something that you might have added or
> > fixed in 0.6.1[12]?
> 
> Probably. Wait a minute ...
> yes. made it more verbose and improved it later than 0.6.10
> so what about an update to 0.6.12 ...
>
As I had tested, and until now, was confident with 0.6.10 I hesitate to update
it.
But stability looks like something I don't have now any way, so by friday I
will update it.
If I do not change the drbd.config file, I should not have any problem
bringing the machine down running a kernel with 0.6.10, and bringing it backup
with 0.6.12, correct?


BTW When drbd puts the kernel, on the secondary machine, into kernel-panic and
you bring the secondary machine back up...
is a quick sync really enough???
or should I bring drbd down and `rm /var/lib/drbd/* -f`, and then bring drbd
back up forcing a full sync? I have been doing the rm. (for what I think is to
keep my sanity).


bar was in secondary and going down, I captured this:
[root at bar root]# cat /proc/drbd 
version: 0.6.10 (api:64/proto:62)

0: cs:Connected st:Secondary/Primary ns:0 nr:3908560 dw:3908560 dr:0 pe:0 ua:0
1: cs:Connected st:Secondary/Primary ns:0 nr:243294580 dw:243294580 dr:0 pe:0
ua:103
2: cs:Connected st:Secondary/Primary ns:0 nr:159564604 dw:159564604 dr:0 pe:0
ua:0
3: cs:Connected st:Secondary/Primary ns:0 nr:160011588 dw:160011588 dr:0 pe:0
ua:0
4: cs:Connected st:Secondary/Primary ns:0 nr:81971464 dw:81971464 dr:0 pe:0
ua:0
5: cs:Connected st:Secondary/Primary ns:0 nr:77534192 dw:77534192 dr:0 pe:0
ua:0
6: cs:Connected st:Secondary/Primary ns:0 nr:79765260 dw:79765260 dr:0 pe:0
ua:0
#about 20 seconds later
[root at bar root]# cat /proc/drbd 
version: 0.6.10 (api:64/proto:62)

0: cs:Connected st:Secondary/Primary ns:0 nr:3908560 dw:3908560 dr:0 pe:0 ua:0
1: cs:Connected st:Secondary/Primary ns:0 nr:243294580 dw:243294580 dr:0 pe:0
ua:103
2: cs:Connected st:Secondary/Primary ns:0 nr:159564604 dw:159564604 dr:0 pe:0
ua:0
3: cs:Connected st:Secondary/Primary ns:0 nr:160011588 dw:160011588 dr:0 pe:0
ua:0
4: cs:Connected st:Secondary/Primary ns:0 nr:81971464 dw:81971464 dr:0 pe:0
ua:0
5: cs:Connected st:Secondary/Primary ns:0 nr:77534192 dw:77534192 dr:0 pe:0
ua:0
6: cs:Connected st:Secondary/Primary ns:0 nr:79765260 dw:79765260 dr:0 pe:0
ua:0


and at about the same time on foo (primary at the time)
version: 0.6.10 (api:64/proto:62)

0: cs:Connected st:Primary/Secondary ns:7827768 nr:0 dw:107504 dr:7726281 pe:0
ua:0
1: cs:StandAlone st:Primary/Unknown ns:674185748 nr:0 dw:10200048 dr:837548481
pe:0 ua:0
2: cs:Connected st:Primary/Secondary ns:318957224 nr:0 dw:664112 dr:334533553
pe:0 ua:0
3: cs:Connected st:Primary/Secondary ns:367069060 nr:0 dw:48766152
dr:396337957 pe:0 ua:0
4: cs:Connected st:Primary/Secondary ns:159589372 nr:0 dw:4532896 dr:155788905
pe:0 ua:0
5: cs:Connected st:Primary/Secondary ns:155072972 nr:0 dw:17936 dr:155074341
pe:0 ua:0
6: cs:Connected st:Primary/Secondary ns:159535520 nr:0 dw:218360 dr:177417773
pe:0 ua:0
#about 70 seconds later
version: 0.6.10 (api:64/proto:62)


0: cs:Connected st:Primary/Secondary ns:7827768 nr:0 dw:107504 dr:7726281 pe:0
ua:0
1: cs:StandAlone st:Primary/Unknown ns:674185748 nr:0 dw:10205984 dr:837548489
pe:0 ua:0
2: cs:WFConnection st:Primary/Unknown ns:318957224 nr:0 dw:664112 dr:334533553
pe:0 ua:0
3: cs:Connected st:Primary/Secondary ns:367069060 nr:0 dw:48766152
dr:396337957 pe:0 ua:0
4: cs:WFConnection st:Primary/Unknown ns:159589372 nr:0 dw:4532896
dr:155788905 pe:0 ua:0
5: cs:WFConnection st:Primary/Unknown ns:155072972 nr:0 dw:17936 dr:155074341
pe:0 ua:0
6: cs:Connected st:Primary/Secondary ns:159535520 nr:0 dw:218360 dr:177417773
pe:0 ua:0


I now have script doing 
 while true ; do sleep 10; date >> /root/drbdwatch ; \
   cat /proc/drbd >> /root/drbdwatch ; done&
to get better data on both systems.
-- 
Todd Denniston
Crane Division, Naval Surface Warfare Center (NSWC Crane) 
Harnessing the Power of Technology for the Warfighter



More information about the drbd-user mailing list