Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Thanks for the reply. Answers to your questions below. Lars Ellenberg wrote: > On Wed, Jul 18, 2007 at 04:19:02PM -0500, alex at crackpot.org wrote: >> My 2 drbd boxen are called 42 and 43. >> drbd version: 0.7.16 (api:77/proto:74) >> >> * Today, 42 was primary. >> * A co-worker noticed that it was not connected to 43. (42 = >> 'st:Primary/Unknown ld:Consistent', 43 = 'st:Secondary/Unknown >> ld:Consistent') >> * I saw that 43 said 'cs:WFConnection'. Co-worker did 'drbdadm >> connect' on 42, and it kernel paniced. > > what cs: was 42 in, before the "drbdadm connect" ? Looks like it was in 'WFReportParams'. This is the last 'drbd' notices in /var/log/messages before yesterday. Jul 12 05:37:42 dellpe2850-42 kernel: drbd0: [kjournald/5686] sock_sendmsg time expired, ko = 4294967295 Jul 12 05:38:03 dellpe2850-42 kernel: drbd0: [kjournald/5686] sock_sendmsg time expired, ko = 4294967295 Jul 12 05:38:06 dellpe2850-42 kernel: drbd0: PingAck did not arrive in time. Jul 12 05:38:06 dellpe2850-42 kernel: drbd0: drbd0_asender [13219]: cstate Connected --> NetworkFailure Jul 12 05:38:06 dellpe2850-42 kernel: drbd0: asender terminated Jul 12 05:38:06 dellpe2850-42 kernel: drbd0: drbd0_receiver [17511]: cstate NetworkFailure --> BrokenPipe Jul 12 05:38:06 dellpe2850-42 kernel: drbd0: short read expecting header on sock: r=-512 Jul 12 05:38:06 dellpe2850-42 kernel: drbd0: short sent UnplugRemote size=8 sent=-1001 Jul 12 05:38:06 dellpe2850-42 kernel: drbd0: worker terminated Jul 12 05:38:06 dellpe2850-42 kernel: drbd0: drbd0_receiver [17511]: cstate BrokenPipe --> Unconnected Jul 12 05:38:06 dellpe2850-42 kernel: drbd0: Connection lost. Jul 12 05:38:06 dellpe2850-42 kernel: drbd0: drbd0_receiver [17511]: cstate Unconnected --> WFConnection Jul 12 05:38:26 dellpe2850-42 kernel: drbd0: drbd0_receiver [17511]: cstate WFConnection --> WFReportParams > what is in the kernel logs, Jul 18 12:51:19 dellpe2850-42 kernel: drbd0: interrupted during initial handshake Jul 18 12:51:19 dellpe2850-42 kernel: drbd0: worker terminated Jul 18 12:51:19 dellpe2850-42 kernel: Unable to handle kernel NULL pointer dereference at 000000000000080c RIP: This is the last entry in /var/log/messages before reboot. > what lead to them being disconnected in the first place? Most likely temporary network failure. > > what does the panic/oops look like? I have only a screen-shot, so I can't paste in the full panic message. I will transcribe it in full if you'd like. There's a call trace containing (among other things) : 'force_sig_info+35', ':drbd:drbd_disconnect+221', ':drbd:drbd_connect+801', ':drbd:drbd_thread_setup'. Final lines are : Code: 81 7f 04 ad 4e ad de 74 1f 48 8b 74 24 18 48 c7 c7 d2 d5 31 RIP <ffffffff80303d8c>{_spin_lock_irqsave+12} RSP <0000010049f6fe28> CR2: 000000000000080c <0>Kernel panic - not syncing: Oops > > did it panic in drbd or somewhere else? drbd > was it an "intentional" panic? I'm not sure how to answer that. > >> * 43 took over as primary as it should. > (with out-of-date data) > >> * When 42 was rebooted, it entered Secondary status and performed a >> sync of data from 43. Since the 2 boxes had been disconnected for >> several days, the data on 43 was old, and the newer data from 42 was >> overwritten. > > yes. > >> We're getting backup restores from tape. We've added better >> monitoring to catch when drbd disconnects in the future. >> >> I am writing because up to this point I thought that a 'drbdadm >> connect' was a fairly safe command to issue. Are there circumstances >> under which it should not be done, or which may cause a panic as we >> saw today? > > those would be bugs. > some of them might be fixed already, > you are 0.7.16, we are 0.7.24? > >> Would doing 'drbdadm disconnect' before 'drbdadm connect' >> have made a difference? > > hard to say. maybe. probably not. > >> If the 2 boxes disconnect in the future (for network failure or >> whatever other reason), what is the safe way to get them talking again? > > >