[DRBD-user] Reproducible ASSERT( os.conn == C_WF_REPORT_PARAMS )

Lars Ellenberg lars.ellenberg at linbit.com
Mon Jul 15 17:49:31 CEST 2013

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.

On Fri, Jul 12, 2013 at 06:10:26PM +0100, Brian Candler wrote:
> I have a setup where I can reliably reproduce the following within a
> few minutes:
> Jul 11 10:59:46 wrn-vm2 kernel: [236603.130604] block drbd0:
> uuid_compare()=-1 by rule 35
> Jul 11 10:59:46 wrn-vm2 kernel: [236603.135779] block drbd0: I shall
> become SyncTarget, but I am primary!

The message above is MUCH more frightening than the line below.

Apparently a node was promoted right in the middle of a resync
handshake, and did not like that at all.

> Jul 11 10:59:46 wrn-vm2 kernel: [236603.142336] block drbd0: ASSERT(
> os.conn == C_WF_REPORT_PARAMS ) in /build/linux-s5x2oE/linux-3.2.46/drivers/block/drbd/drbd_receiver.c:3245

This is of no concern, actually,
and only a followup thing of the above.

> It's on Debian Wheezy with Debian stock kernel (3.2.0-4-amd64).
> Jun 25 15:01:27 wrn-vm1 kernel: [  626.901545] drbd: initialized.
> Version: 8.3.11 (api:88/proto:86-96)

tried a more recent DRBD already?

> Jun 25 15:01:27 wrn-vm1 kernel: [  626.901547] drbd: srcversion:
> F937DCB2E5D83C6CCE4A6C9
> There are more details in this thread:
> https://groups.google.com/forum/#!topic/ganeti/icqLNFk1si0
> I am reproducing it using ganeti, which uses drbd on top of LVM
> logical volumes to replicate virtual machine images. It migrates
> virtual machines by sending drdbsetup commands to switch
> master->slave replication firstly to multi-master, and then to
> slave<-master (apparently by disconnecting and reconnecting). I
> believe there is some sort of race condition going on, because (a)
> it seems few if any other people observe what I see; and (b)
> although I can reproduce the problem within a few minutes, if I
> attach a full-blown strace to the process which is issuing the
> drbdsetup calls, the problem goes away.

In that backend script,

add a loop before the promote,
that checks that the connection state really is "Connected",
and the disk state really is "UpToDate".

It is probably supposed to already wait in this fashion,
but apparently it does not.

> The google groups thread includes an strace log of execve() calls,
> so you can see what sequence of drbdsetup calls are being issued. Is
> it possible that ganeti is taking an unsafe approach to switching
> over the drbd state?
> Regards,
> Brian Candler.

: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
please don't Cc me, but send to list   --   I'm subscribed

More information about the drbd-user mailing list