[DRBD-user] can drbd be made to detect that it has failed to write to the underlying device in a 'long time'?

Thu Apr 15 07:02:29 CEST 2004

/ 2004-04-14 15:09:10 -0500
\ Todd Denniston:
> > > on a positive note, about 22 seconds into the secondaries Card dump (I think
> > > this is the time that the lockup starts) I got the following on the active
> > > primary:
> > > Apr 13 15:36:23 foo kernel: drbd1: sock_sendmsg time expired on sock
> > > Apr 13 15:36:23 foo kernel: drbd1: no data sent since 10 ping intervals, peer
> > > seems knocked out: going to StandAlone.
> > > Apr 13 15:36:23 foo kernel: drbd1: Connection lost.
> > 
> > Thats what I expect.
> > 
> I kind of wonder why I only get that on one of the 7 drbd*'s I have though.
> the rest I just get 'Connection lost.' messages over the next couple of
> minutes, instead of all at once.
> 
> Apr 14 13:38:24 foo kernel: drbd1: sock_sendmsg time expired on sock
> Apr 14 13:38:24 foo kernel: drbd1: no data sent since 10 ping intervals, peer
> seems knocked out: going to StandAlone.
> Apr 14 13:38:24 foo kernel: drbd1: Connection lost.
> Apr 14 13:40:04 foo kernel: drbd5: Connection lost.
> Apr 14 13:40:05 foo kernel: drbd2: Connection lost.
> Apr 14 13:40:05 foo kernel: drbd4: Connection lost.
> Apr 14 13:40:10 foo kernel: drbd6: Connection lost.
> Apr 14 13:40:10 foo kernel: drbd0: Connection lost.
> Apr 14 13:40:10 foo kernel: drbd3: Connection lost.
> Apr 14 13:41:57 foo ipfail[4144]: info: Status update: Node bar now
>					has status dead

maybe they did not transfer any data blocks at that time?
or not enough so they still fit in the local tcp send buffer?

> If I do not change the drbd.config file, I should not have any problem
> bringing the machine down running a kernel with 0.6.10, and bringing
> it backup with 0.6.12, correct?

right.

> BTW When drbd puts the kernel, on the secondary machine, into
> kernel-panic and you bring the secondary machine back up...  is a
> quick sync really enough???

good question.
  probably not, and it should be written into the meta data first, so it
will ensure a full sync, or maybe even refuse to do anything until
operator confirmed: "yes, hardware *is* ok again.", and only then do the
full sync.

Philipp and others, please comment.

> or should I bring drbd down and `rm /var/lib/drbd/* -f`, and then bring drbd
> back up forcing a full sync? I have been doing the rm. (for what I think is to
> keep my sanity).

you can also request a full sync by "drbdsetup /dev/nbX replicate"
...

> bar was in secondary and going down, I captured this:
> [root at bar root]# cat /proc/drbd 
<snip/>

at first glance it looks like the other devices indeed where idle at
that time, so they won't break it some time later the first device sees
the io-error and panics the box, and that is when the other devices
loose connection, too.
as long as nothing is written, we won't notice broken hardware, would we?
of course, we could *probe* the hardware at intervals, but... naaaa
not our business, is it?

one could improve the situation, if drbd "knows" which lo devs are
basically the same hardware, and then if one of them fails, all others
fail at the same time, idle or not...

No, I won't go and try to write some autodetection mechanism.
But this may be an interessting additional config thing for drbd 0.7.
Similar to "sync-group", we can have a "lo-dev-group":
This makes sense, since DRBD 0.7. does not panic by default, but
goes into "diskless" mode. If several devices share the same lo-dev,
each of them in turn would need to wait for the hw-driver retry timeout 
cycle to recognize what we could have known already...

	Lars Ellenberg