[DRBD-user] Reproducible ASSERT( os.conn == C_WF_REPORT_PARAMS )

Fri Jul 12 19:10:26 CEST 2013

I have a setup where I can reliably reproduce the following within a few 
minutes:

Jul 11 10:59:46 wrn-vm2 kernel: [236603.130604] block drbd0: 
uuid_compare()=-1 by rule 35
Jul 11 10:59:46 wrn-vm2 kernel: [236603.135779] block drbd0: I shall 
become SyncTarget, but I am primary!
Jul 11 10:59:46 wrn-vm2 kernel: [236603.142336] block drbd0: ASSERT( 
os.conn == C_WF_REPORT_PARAMS ) in 
/build/linux-s5x2oE/linux-3.2.46/drivers/block/drbd/drbd_receiver.c:3245

It's on Debian Wheezy with Debian stock kernel (3.2.0-4-amd64).

Jun 25 15:01:27 wrn-vm1 kernel: [  626.901545] drbd: initialized. 
Version: 8.3.11 (api:88/proto:86-96)
Jun 25 15:01:27 wrn-vm1 kernel: [  626.901547] drbd: srcversion: 
F937DCB2E5D83C6CCE4A6C9

There are more details in this thread:
https://groups.google.com/forum/#!topic/ganeti/icqLNFk1si0

I am reproducing it using ganeti, which uses drbd on top of LVM logical 
volumes to replicate virtual machine images. It migrates virtual 
machines by sending drdbsetup commands to switch master->slave 
replication firstly to multi-master, and then to slave<-master 
(apparently by disconnecting and reconnecting). I believe there is some 
sort of race condition going on, because (a) it seems few if any other 
people observe what I see; and (b) although I can reproduce the problem 
within a few minutes, if I attach a full-blown strace to the process 
which is issuing the drbdsetup calls, the problem goes away.

The google groups thread includes an strace log of execve() calls, so 
you can see what sequence of drbdsetup calls are being issued. Is it 
possible that ganeti is taking an unsafe approach to switching over the 
drbd state?

Regards,

Brian Candler.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20130712/979130d9/attachment.htm>