Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Thu, Jul 19, 2007 at 01:58:38AM -0700, Alex Dean wrote: > Thanks for the reply. Answers to your questions below. > > Lars Ellenberg wrote: > > >On Wed, Jul 18, 2007 at 04:19:02PM -0500, alex at crackpot.org wrote: > >>My 2 drbd boxen are called 42 and 43. > >>drbd version: 0.7.16 (api:77/proto:74) > >> > >>* Today, 42 was primary. > >>* A co-worker noticed that it was not connected to 43. (42 = > >>'st:Primary/Unknown ld:Consistent', 43 = 'st:Secondary/Unknown > >>ld:Consistent') > >>* I saw that 43 said 'cs:WFConnection'. Co-worker did 'drbdadm > >>connect' on 42, and it kernel paniced. > > > >what cs: was 42 in, before the "drbdadm connect" ? > > Looks like it was in 'WFReportParams'. This is the last 'drbd' notices > in /var/log/messages before yesterday. > > Jul 12 05:37:42 dellpe2850-42 kernel: drbd0: [kjournald/5686] > sock_sendmsg time expired, ko = 4294967295 > Jul 12 05:38:03 dellpe2850-42 kernel: drbd0: [kjournald/5686] > sock_sendmsg time expired, ko = 4294967295 > Jul 12 05:38:06 dellpe2850-42 kernel: drbd0: PingAck did not arrive in time. > Jul 12 05:38:06 dellpe2850-42 kernel: drbd0: drbd0_asender [13219]: > cstate Connected --> NetworkFailure > Jul 12 05:38:06 dellpe2850-42 kernel: drbd0: asender terminated > Jul 12 05:38:06 dellpe2850-42 kernel: drbd0: drbd0_receiver [17511]: > cstate NetworkFailure --> BrokenPipe > Jul 12 05:38:06 dellpe2850-42 kernel: drbd0: short read expecting header > on sock: r=-512 > Jul 12 05:38:06 dellpe2850-42 kernel: drbd0: short sent UnplugRemote > size=8 sent=-1001 > Jul 12 05:38:06 dellpe2850-42 kernel: drbd0: worker terminated > Jul 12 05:38:06 dellpe2850-42 kernel: drbd0: drbd0_receiver [17511]: > cstate BrokenPipe --> Unconnected > Jul 12 05:38:06 dellpe2850-42 kernel: drbd0: Connection lost. > Jul 12 05:38:06 dellpe2850-42 kernel: drbd0: drbd0_receiver [17511]: > cstate Unconnected --> WFConnection > Jul 12 05:38:26 dellpe2850-42 kernel: drbd0: drbd0_receiver [17511]: > cstate WFConnection --> WFReportParams > > >what is in the kernel logs, > > Jul 18 12:51:19 dellpe2850-42 kernel: drbd0: interrupted during initial > handshake > Jul 18 12:51:19 dellpe2850-42 kernel: drbd0: worker terminated > Jul 18 12:51:19 dellpe2850-42 kernel: Unable to handle kernel NULL > pointer dereference at 000000000000080c RIP: > > This is the last entry in /var/log/messages before reboot. > > >what lead to them being disconnected in the first place? > > Most likely temporary network failure. > > > > >what does the panic/oops look like? > > I have only a screen-shot, so I can't paste in the full panic message. > I will transcribe it in full if you'd like. There's a call trace > containing (among other things) : 'force_sig_info+35', ok... interrupted during initial handshake, then NULL pointer dereference in force_sig_info... this appears to be because of a race-condition bug I remember vaguely. I'm not sure exactly which changelog item of which 0.7.x this corresponds to, but it should be fixed in the newest 0.7. -- : Lars Ellenberg Tel +43-1-8178292-0 : : LINBIT Information Technologies GmbH Fax +43-1-8178292-82 : : Vivenotgasse 48, A-1120 Vienna/Europe http://www.linbit.com : __ please use the "List-Reply" function of your email client.