Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Lars Ellenberg wrote: > > / 2004-04-13 16:48:09 -0500 > \ Todd Denniston: > > Darn, I was not doing anything using the disk at the time it had a problem > > today, and missed the slow spot. > > > > on a positive note, about 22 seconds into the secondaries Card dump (I think > > this is the time that the lockup starts) I got the following on the active > > primary: > > Apr 13 15:36:23 foo kernel: drbd1: sock_sendmsg time expired on sock > > Apr 13 15:36:23 foo kernel: drbd1: no data sent since 10 ping intervals, peer > > seems knocked out: going to StandAlone. > > Apr 13 15:36:23 foo kernel: drbd1: Connection lost. > > Thats what I expect. > I kind of wonder why I only get that on one of the 7 drbd*'s I have though. the rest I just get 'Connection lost.' messages over the next couple of minutes, instead of all at once. Apr 14 13:38:24 foo kernel: drbd1: sock_sendmsg time expired on sock Apr 14 13:38:24 foo kernel: drbd1: no data sent since 10 ping intervals, peer seems knocked out: going to StandAlone. Apr 14 13:38:24 foo kernel: drbd1: Connection lost. Apr 14 13:40:04 foo kernel: drbd5: Connection lost. Apr 14 13:40:05 foo kernel: drbd2: Connection lost. Apr 14 13:40:05 foo kernel: drbd4: Connection lost. Apr 14 13:40:10 foo kernel: drbd6: Connection lost. Apr 14 13:40:10 foo kernel: drbd0: Connection lost. Apr 14 13:40:10 foo kernel: drbd3: Connection lost. Apr 14 13:41:57 foo ipfail[4144]: info: Status update: Node bar now has status dead > > so was the 'ko count down' messages something that you might have added or > > fixed in 0.6.1[12]? > > Probably. Wait a minute ... > yes. made it more verbose and improved it later than 0.6.10 > so what about an update to 0.6.12 ... > As I had tested, and until now, was confident with 0.6.10 I hesitate to update it. But stability looks like something I don't have now any way, so by friday I will update it. If I do not change the drbd.config file, I should not have any problem bringing the machine down running a kernel with 0.6.10, and bringing it backup with 0.6.12, correct? BTW When drbd puts the kernel, on the secondary machine, into kernel-panic and you bring the secondary machine back up... is a quick sync really enough??? or should I bring drbd down and `rm /var/lib/drbd/* -f`, and then bring drbd back up forcing a full sync? I have been doing the rm. (for what I think is to keep my sanity). bar was in secondary and going down, I captured this: [root at bar root]# cat /proc/drbd version: 0.6.10 (api:64/proto:62) 0: cs:Connected st:Secondary/Primary ns:0 nr:3908560 dw:3908560 dr:0 pe:0 ua:0 1: cs:Connected st:Secondary/Primary ns:0 nr:243294580 dw:243294580 dr:0 pe:0 ua:103 2: cs:Connected st:Secondary/Primary ns:0 nr:159564604 dw:159564604 dr:0 pe:0 ua:0 3: cs:Connected st:Secondary/Primary ns:0 nr:160011588 dw:160011588 dr:0 pe:0 ua:0 4: cs:Connected st:Secondary/Primary ns:0 nr:81971464 dw:81971464 dr:0 pe:0 ua:0 5: cs:Connected st:Secondary/Primary ns:0 nr:77534192 dw:77534192 dr:0 pe:0 ua:0 6: cs:Connected st:Secondary/Primary ns:0 nr:79765260 dw:79765260 dr:0 pe:0 ua:0 #about 20 seconds later [root at bar root]# cat /proc/drbd version: 0.6.10 (api:64/proto:62) 0: cs:Connected st:Secondary/Primary ns:0 nr:3908560 dw:3908560 dr:0 pe:0 ua:0 1: cs:Connected st:Secondary/Primary ns:0 nr:243294580 dw:243294580 dr:0 pe:0 ua:103 2: cs:Connected st:Secondary/Primary ns:0 nr:159564604 dw:159564604 dr:0 pe:0 ua:0 3: cs:Connected st:Secondary/Primary ns:0 nr:160011588 dw:160011588 dr:0 pe:0 ua:0 4: cs:Connected st:Secondary/Primary ns:0 nr:81971464 dw:81971464 dr:0 pe:0 ua:0 5: cs:Connected st:Secondary/Primary ns:0 nr:77534192 dw:77534192 dr:0 pe:0 ua:0 6: cs:Connected st:Secondary/Primary ns:0 nr:79765260 dw:79765260 dr:0 pe:0 ua:0 and at about the same time on foo (primary at the time) version: 0.6.10 (api:64/proto:62) 0: cs:Connected st:Primary/Secondary ns:7827768 nr:0 dw:107504 dr:7726281 pe:0 ua:0 1: cs:StandAlone st:Primary/Unknown ns:674185748 nr:0 dw:10200048 dr:837548481 pe:0 ua:0 2: cs:Connected st:Primary/Secondary ns:318957224 nr:0 dw:664112 dr:334533553 pe:0 ua:0 3: cs:Connected st:Primary/Secondary ns:367069060 nr:0 dw:48766152 dr:396337957 pe:0 ua:0 4: cs:Connected st:Primary/Secondary ns:159589372 nr:0 dw:4532896 dr:155788905 pe:0 ua:0 5: cs:Connected st:Primary/Secondary ns:155072972 nr:0 dw:17936 dr:155074341 pe:0 ua:0 6: cs:Connected st:Primary/Secondary ns:159535520 nr:0 dw:218360 dr:177417773 pe:0 ua:0 #about 70 seconds later version: 0.6.10 (api:64/proto:62) 0: cs:Connected st:Primary/Secondary ns:7827768 nr:0 dw:107504 dr:7726281 pe:0 ua:0 1: cs:StandAlone st:Primary/Unknown ns:674185748 nr:0 dw:10205984 dr:837548489 pe:0 ua:0 2: cs:WFConnection st:Primary/Unknown ns:318957224 nr:0 dw:664112 dr:334533553 pe:0 ua:0 3: cs:Connected st:Primary/Secondary ns:367069060 nr:0 dw:48766152 dr:396337957 pe:0 ua:0 4: cs:WFConnection st:Primary/Unknown ns:159589372 nr:0 dw:4532896 dr:155788905 pe:0 ua:0 5: cs:WFConnection st:Primary/Unknown ns:155072972 nr:0 dw:17936 dr:155074341 pe:0 ua:0 6: cs:Connected st:Primary/Secondary ns:159535520 nr:0 dw:218360 dr:177417773 pe:0 ua:0 I now have script doing while true ; do sleep 10; date >> /root/drbdwatch ; \ cat /proc/drbd >> /root/drbdwatch ; done& to get better data on both systems. -- Todd Denniston Crane Division, Naval Surface Warfare Center (NSWC Crane) Harnessing the Power of Technology for the Warfighter