[DRBD-user] 'drbdadm connect' panic?

Thu Jul 19 10:58:38 CEST 2007

Thanks for the reply.  Answers to your questions below.

Lars Ellenberg wrote:

> On Wed, Jul 18, 2007 at 04:19:02PM -0500, alex at crackpot.org wrote:
>> My 2 drbd boxen are called 42 and 43.
>> drbd version: 0.7.16 (api:77/proto:74)
>>
>> * Today, 42 was primary.
>> * A co-worker noticed that it was not connected to 43.  (42 =  
>> 'st:Primary/Unknown ld:Consistent', 43 = 'st:Secondary/Unknown  
>> ld:Consistent')
>> * I saw that 43 said 'cs:WFConnection'.  Co-worker did 'drbdadm  
>> connect' on 42, and it kernel paniced.
> 
> what cs: was 42 in, before the "drbdadm connect" ?

Looks like it was in 'WFReportParams'.  This is the last 'drbd' notices 
in /var/log/messages before yesterday.

Jul 12 05:37:42 dellpe2850-42 kernel: drbd0: [kjournald/5686] 
sock_sendmsg time expired, ko = 4294967295
Jul 12 05:38:03 dellpe2850-42 kernel: drbd0: [kjournald/5686] 
sock_sendmsg time expired, ko = 4294967295
Jul 12 05:38:06 dellpe2850-42 kernel: drbd0: PingAck did not arrive in time.
Jul 12 05:38:06 dellpe2850-42 kernel: drbd0: drbd0_asender [13219]: 
cstate Connected --> NetworkFailure
Jul 12 05:38:06 dellpe2850-42 kernel: drbd0: asender terminated
Jul 12 05:38:06 dellpe2850-42 kernel: drbd0: drbd0_receiver [17511]: 
cstate NetworkFailure --> BrokenPipe
Jul 12 05:38:06 dellpe2850-42 kernel: drbd0: short read expecting header 
on sock: r=-512
Jul 12 05:38:06 dellpe2850-42 kernel: drbd0: short sent UnplugRemote 
size=8 sent=-1001
Jul 12 05:38:06 dellpe2850-42 kernel: drbd0: worker terminated
Jul 12 05:38:06 dellpe2850-42 kernel: drbd0: drbd0_receiver [17511]: 
cstate BrokenPipe --> Unconnected
Jul 12 05:38:06 dellpe2850-42 kernel: drbd0: Connection lost.
Jul 12 05:38:06 dellpe2850-42 kernel: drbd0: drbd0_receiver [17511]: 
cstate Unconnected --> WFConnection
Jul 12 05:38:26 dellpe2850-42 kernel: drbd0: drbd0_receiver [17511]: 
cstate WFConnection --> WFReportParams

> what is in the kernel logs,

Jul 18 12:51:19 dellpe2850-42 kernel: drbd0: interrupted during initial 
handshake
Jul 18 12:51:19 dellpe2850-42 kernel: drbd0: worker terminated
Jul 18 12:51:19 dellpe2850-42 kernel: Unable to handle kernel NULL 
pointer dereference at 000000000000080c RIP:

This is the last entry in /var/log/messages before reboot.

> what lead to them being disconnected in the first place?

Most likely temporary network failure.

> 
> what does the panic/oops look like?

I have only a screen-shot, so I can't paste in the full panic message. 
I will transcribe it in full if you'd like.  There's a call trace 
containing (among other things) : 'force_sig_info+35', 
':drbd:drbd_disconnect+221', ':drbd:drbd_connect+801', 
':drbd:drbd_thread_setup'.

Final lines are :
Code: 81 7f 04 ad 4e ad de 74 1f 48 8b 74 24 18 48 c7 c7 d2 d5 31
RIP <ffffffff80303d8c>{_spin_lock_irqsave+12} RSP <0000010049f6fe28>
CR2: 000000000000080c
   <0>Kernel panic - not syncing: Oops

> 
> did it panic in drbd or somewhere else?

drbd

> was it an "intentional" panic?

I'm not sure how to answer that.

> 
>> * 43 took over as primary as it should.
>  (with out-of-date data)
> 
>> * When 42 was rebooted, it entered Secondary status and performed a  
>> sync of data from 43.  Since the 2 boxes had been disconnected for  
>> several days, the data on 43 was old, and the newer data from 42 was  
>> overwritten.
> 
> yes.
> 
>> We're getting backup restores from tape.  We've added better  
>> monitoring to catch when drbd disconnects in the future.
>>
>> I am writing because up to this point I thought that a 'drbdadm  
>> connect' was a fairly safe command to issue.  Are there circumstances  
>> under which it should not be done, or which may cause a panic as we  
>> saw today?
> 
> those would be bugs.
> some of them might be fixed already,
> you are 0.7.16, we are 0.7.24?
> 
>> Would doing 'drbdadm disconnect' before 'drbdadm connect'  
>> have made a difference?
> 
> hard to say. maybe. probably not.
> 
>> If the 2 boxes disconnect in the future (for network failure or  
>> whatever other reason), what is the safe way to get them talking again?
> 
> 
>