Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
/ 2004-04-14 15:09:10 -0500 \ Todd Denniston: > > > on a positive note, about 22 seconds into the secondaries Card dump (I think > > > this is the time that the lockup starts) I got the following on the active > > > primary: > > > Apr 13 15:36:23 foo kernel: drbd1: sock_sendmsg time expired on sock > > > Apr 13 15:36:23 foo kernel: drbd1: no data sent since 10 ping intervals, peer > > > seems knocked out: going to StandAlone. > > > Apr 13 15:36:23 foo kernel: drbd1: Connection lost. > > > > Thats what I expect. > > > I kind of wonder why I only get that on one of the 7 drbd*'s I have though. > the rest I just get 'Connection lost.' messages over the next couple of > minutes, instead of all at once. > > Apr 14 13:38:24 foo kernel: drbd1: sock_sendmsg time expired on sock > Apr 14 13:38:24 foo kernel: drbd1: no data sent since 10 ping intervals, peer > seems knocked out: going to StandAlone. > Apr 14 13:38:24 foo kernel: drbd1: Connection lost. > Apr 14 13:40:04 foo kernel: drbd5: Connection lost. > Apr 14 13:40:05 foo kernel: drbd2: Connection lost. > Apr 14 13:40:05 foo kernel: drbd4: Connection lost. > Apr 14 13:40:10 foo kernel: drbd6: Connection lost. > Apr 14 13:40:10 foo kernel: drbd0: Connection lost. > Apr 14 13:40:10 foo kernel: drbd3: Connection lost. > Apr 14 13:41:57 foo ipfail[4144]: info: Status update: Node bar now > has status dead maybe they did not transfer any data blocks at that time? or not enough so they still fit in the local tcp send buffer? > If I do not change the drbd.config file, I should not have any problem > bringing the machine down running a kernel with 0.6.10, and bringing > it backup with 0.6.12, correct? right. > BTW When drbd puts the kernel, on the secondary machine, into > kernel-panic and you bring the secondary machine back up... is a > quick sync really enough??? good question. probably not, and it should be written into the meta data first, so it will ensure a full sync, or maybe even refuse to do anything until operator confirmed: "yes, hardware *is* ok again.", and only then do the full sync. Philipp and others, please comment. > or should I bring drbd down and `rm /var/lib/drbd/* -f`, and then bring drbd > back up forcing a full sync? I have been doing the rm. (for what I think is to > keep my sanity). you can also request a full sync by "drbdsetup /dev/nbX replicate" ... > bar was in secondary and going down, I captured this: > [root at bar root]# cat /proc/drbd <snip/> at first glance it looks like the other devices indeed where idle at that time, so they won't break it some time later the first device sees the io-error and panics the box, and that is when the other devices loose connection, too. as long as nothing is written, we won't notice broken hardware, would we? of course, we could *probe* the hardware at intervals, but... naaaa not our business, is it? one could improve the situation, if drbd "knows" which lo devs are basically the same hardware, and then if one of them fails, all others fail at the same time, idle or not... No, I won't go and try to write some autodetection mechanism. But this may be an interessting additional config thing for drbd 0.7. Similar to "sync-group", we can have a "lo-dev-group": This makes sense, since DRBD 0.7. does not panic by default, but goes into "diskless" mode. If several devices share the same lo-dev, each of them in turn would need to wait for the hw-driver retry timeout cycle to recognize what we could have known already... Lars Ellenberg