Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Thanks for the reply. Answers to your questions below.
Lars Ellenberg wrote:
> On Wed, Jul 18, 2007 at 04:19:02PM -0500, alex at crackpot.org wrote:
>> My 2 drbd boxen are called 42 and 43.
>> drbd version: 0.7.16 (api:77/proto:74)
>>
>> * Today, 42 was primary.
>> * A co-worker noticed that it was not connected to 43. (42 =
>> 'st:Primary/Unknown ld:Consistent', 43 = 'st:Secondary/Unknown
>> ld:Consistent')
>> * I saw that 43 said 'cs:WFConnection'. Co-worker did 'drbdadm
>> connect' on 42, and it kernel paniced.
>
> what cs: was 42 in, before the "drbdadm connect" ?
Looks like it was in 'WFReportParams'. This is the last 'drbd' notices
in /var/log/messages before yesterday.
Jul 12 05:37:42 dellpe2850-42 kernel: drbd0: [kjournald/5686]
sock_sendmsg time expired, ko = 4294967295
Jul 12 05:38:03 dellpe2850-42 kernel: drbd0: [kjournald/5686]
sock_sendmsg time expired, ko = 4294967295
Jul 12 05:38:06 dellpe2850-42 kernel: drbd0: PingAck did not arrive in time.
Jul 12 05:38:06 dellpe2850-42 kernel: drbd0: drbd0_asender [13219]:
cstate Connected --> NetworkFailure
Jul 12 05:38:06 dellpe2850-42 kernel: drbd0: asender terminated
Jul 12 05:38:06 dellpe2850-42 kernel: drbd0: drbd0_receiver [17511]:
cstate NetworkFailure --> BrokenPipe
Jul 12 05:38:06 dellpe2850-42 kernel: drbd0: short read expecting header
on sock: r=-512
Jul 12 05:38:06 dellpe2850-42 kernel: drbd0: short sent UnplugRemote
size=8 sent=-1001
Jul 12 05:38:06 dellpe2850-42 kernel: drbd0: worker terminated
Jul 12 05:38:06 dellpe2850-42 kernel: drbd0: drbd0_receiver [17511]:
cstate BrokenPipe --> Unconnected
Jul 12 05:38:06 dellpe2850-42 kernel: drbd0: Connection lost.
Jul 12 05:38:06 dellpe2850-42 kernel: drbd0: drbd0_receiver [17511]:
cstate Unconnected --> WFConnection
Jul 12 05:38:26 dellpe2850-42 kernel: drbd0: drbd0_receiver [17511]:
cstate WFConnection --> WFReportParams
> what is in the kernel logs,
Jul 18 12:51:19 dellpe2850-42 kernel: drbd0: interrupted during initial
handshake
Jul 18 12:51:19 dellpe2850-42 kernel: drbd0: worker terminated
Jul 18 12:51:19 dellpe2850-42 kernel: Unable to handle kernel NULL
pointer dereference at 000000000000080c RIP:
This is the last entry in /var/log/messages before reboot.
> what lead to them being disconnected in the first place?
Most likely temporary network failure.
>
> what does the panic/oops look like?
I have only a screen-shot, so I can't paste in the full panic message.
I will transcribe it in full if you'd like. There's a call trace
containing (among other things) : 'force_sig_info+35',
':drbd:drbd_disconnect+221', ':drbd:drbd_connect+801',
':drbd:drbd_thread_setup'.
Final lines are :
Code: 81 7f 04 ad 4e ad de 74 1f 48 8b 74 24 18 48 c7 c7 d2 d5 31
RIP <ffffffff80303d8c>{_spin_lock_irqsave+12} RSP <0000010049f6fe28>
CR2: 000000000000080c
<0>Kernel panic - not syncing: Oops
>
> did it panic in drbd or somewhere else?
drbd
> was it an "intentional" panic?
I'm not sure how to answer that.
>
>> * 43 took over as primary as it should.
> (with out-of-date data)
>
>> * When 42 was rebooted, it entered Secondary status and performed a
>> sync of data from 43. Since the 2 boxes had been disconnected for
>> several days, the data on 43 was old, and the newer data from 42 was
>> overwritten.
>
> yes.
>
>> We're getting backup restores from tape. We've added better
>> monitoring to catch when drbd disconnects in the future.
>>
>> I am writing because up to this point I thought that a 'drbdadm
>> connect' was a fairly safe command to issue. Are there circumstances
>> under which it should not be done, or which may cause a panic as we
>> saw today?
>
> those would be bugs.
> some of them might be fixed already,
> you are 0.7.16, we are 0.7.24?
>
>> Would doing 'drbdadm disconnect' before 'drbdadm connect'
>> have made a difference?
>
> hard to say. maybe. probably not.
>
>> If the 2 boxes disconnect in the future (for network failure or
>> whatever other reason), what is the safe way to get them talking again?
>
>
>