[DRBD-user] DRBD within VM

Wed Sep 7 08:59:44 CEST 2011

Hi all,

I had an issue here to tell- I assume this behavior is intended but this
time it caused issues even with clean shut down.

First, my setup is a single primary (Virtual Machine/ VM) and single
secondary (physical box). Due to power outage I had to shutdown the
boxes. Clear shutdown, no hard power off!
I shut down the secondary first then followed by the primary.  This is
log from primary:
Sep  6 08:11:20 backuppc shutdown[20850]: shutting down for system halt
Sep  6 08:11:23 backuppc kernel: drbd0: State change failed: Device is
held open by someone
Sep  6 08:11:23 backuppc kernel: drbd0:   state = { cs:Connected
st:Primary/Secondary ds:UpToDate/UpToDate r--- }
Sep  6 08:11:23 backuppc kernel: drbd0:  wanted = { cs:Connected
st:Secondary/Secondary ds:UpToDate/UpToDate r--- }
Sep  6 08:11:26 backuppc kernel: VMware memory control driver unloaded
Sep  6 08:11:26 backuppc kernel: Removing vmci device
Sep  6 08:11:26 backuppc kernel: Resetting vmci device
Sep  6 08:11:26 backuppc kernel: Unregistered vmci device.
Sep  6 08:11:26 backuppc kernel: ACPI: PCI interrupt for device
0000:00:07.7 disabled
Sep  6 08:11:39 backuppc kernel: drbd0: peer( Secondary -> Unknown )
conn( Connected -> TearDown ) pdsk( UpToDate -> DUnknown )
Sep  6 08:11:39 backuppc kernel: drbd0: Writing meta data super block now.
Sep  6 08:11:39 backuppc kernel: drbd0: Creating new current UUID
Sep  6 08:11:39 backuppc kernel: drbd0: Writing meta data super block now.
Sep  6 08:11:39 backuppc kernel: drbd0: meta connection shut down by peer.
Sep  6 08:11:39 backuppc kernel: drbd0: asender terminated
Sep  6 08:11:39 backuppc kernel: drbd0: Terminating asender thread
Sep  6 08:11:39 backuppc kernel: drbd0: tl_clear()
Sep  6 08:11:39 backuppc kernel: drbd0: Connection closed
Sep  6 08:11:39 backuppc kernel: drbd0: conn( TearDown -> Unconnected )
Sep  6 08:11:39 backuppc kernel: drbd0: receiver terminated
Sep  6 08:11:39 backuppc kernel: drbd0: receiver (re)started
Sep  6 08:11:39 backuppc kernel: drbd0: conn( Unconnected -> WFConnection )

Both where in consistent "Up-to-date" state before.
The drbd devices didn't come up properly after outage.

The root cause seems to be in the disk driver needed to access the disk
inside the VM. The primary booted with an updated kernel so the SCSI
driver was not available. Because of this drbd could not access the
physical disk (and the secondary was still powered off) and went into
diskless mode. Additionally the network was not available so it couldn't
access any network resources. I'm not sure about the disk access but
I'm, sure about the missing network. See logfile for this:

Sep  6 08:19:30 backuppc kernel: drbd0: Starting receiver thread (from
drbd0_worker [3082])
Sep  6 08:19:30 backuppc kernel: drbd0: receiver (re)started
Sep  6 08:19:30 backuppc kernel: drbd0: conn( Unconnected -> WFConnection )
Sep  6 08:19:30 backuppc kernel: drbd0: Unable to bind source sock (-99)
Sep  6 08:19:30 backuppc last message repeated 2 times
Sep  6 08:19:30 backuppc kernel: drbd0: Unable to bind sock2 (-99)
Sep  6 08:19:30 backuppc kernel: drbd0: conn( WFConnection ->
Disconnecting )
Sep  6 08:19:30 backuppc kernel: drbd0: Discarding network configuration.
Sep  6 08:19:30 backuppc kernel: drbd0: tl_clear()
Sep  6 08:19:30 backuppc kernel: drbd0: Connection closed
Sep  6 08:19:30 backuppc kernel: drbd0: conn( Disconnecting -> StandAlone )
Sep  6 08:19:30 backuppc kernel: drbd0: ASSERT( mdev->receiver.t_state
== None ) in
/home/buildsvn/rpmbuild/BUILD/drbd-8.2.6/_kmod_build_/drbd/drbd_main.c:2412
Sep  6 08:19:30 backuppc kernel: drbd0: drbd_bm_resize called with
capacity == 0
Sep  6 08:19:30 backuppc kernel: drbd0: worker terminated
Sep  6 08:19:30 backuppc kernel: drbd0: Terminating worker thread
Sep  6 08:19:30 backuppc kernel: drbd0: receiver terminated
Sep  6 08:19:30 backuppc kernel: drbd0: Terminating receiver thread
Sep  6 08:19:30 backuppc kernel: drbd0: State change failed: Refusing to
be Primary without at least one UpToDate disk
Sep  6 08:19:30 backuppc kernel: drbd0:   state = { cs:StandAlone
st:Secondary/Unknown ds:Diskless/DUnknown r--- }
Sep  6 08:19:30 backuppc kernel: drbd0:  wanted = { cs:StandAlone
st:Primary/Unknown ds:Diskless/DUnknown r--- }

This is the result of "cat /proc/drbd" on the primary one after fixing
network and driver issues now:
version: 8.2.6 (api:88/proto:86-88)
GIT-hash: 3e69822d3bb4920a8c1bfdf7d647169eba7d2eb4 build by
buildsvn at c5-i386-build, 2008-10-03 11:42:32
 0: cs:Connected st:Secondary/Secondary ds:Diskless/UpToDate A r---
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 oos:0

And on the secondary:
version: 8.2.6 (api:88/proto:86-88)
GIT-hash: 3e69822d3bb4920a8c1bfdf7d647169eba7d2eb4 build by
buildsvn at c5-i386-build, 2008-10-03 11:42:32
 0: cs:Connected st:Secondary/Secondary ds:UpToDate/Diskless A r---
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 oos:0

These are the logfiles writen at time of shutdown [secondary]:
Sep  6 08:10:47 drbd kernel: drbd0: peer( Primary -> Unknown ) conn(
Connected -> Disconnecting ) pdsk( UpToDate -> DUnknown )
Sep  6 08:10:47 drbd kernel: drbd0: short read expecting header on sock:
r=-512
Sep  6 08:10:47 drbd kernel: drbd0: asender terminated
Sep  6 08:10:47 drbd kernel: drbd0: Terminating asender thread
Sep  6 08:10:47 drbd kernel: drbd0: Writing meta data super block now.
Sep  6 08:10:48 drbd kernel: drbd0: tl_clear()
Sep  6 08:10:48 drbd kernel: drbd0: Connection closed
Sep  6 08:10:48 drbd kernel: drbd0: conn( Disconnecting -> StandAlone )
Sep  6 08:10:48 drbd kernel: drbd0: receiver terminated
Sep  6 08:10:48 drbd kernel: drbd0: Terminating receiver thread
Sep  6 08:10:48 drbd kernel: drbd0: disk( UpToDate -> Diskless )
Sep  6 08:10:48 drbd kernel: drbd0: drbd_bm_resize called with capacity == 0
Sep  6 08:10:48 drbd kernel: drbd0: worker terminated
Sep  6 08:10:48 drbd kernel: drbd0: Terminating worker thread
Sep  6 08:10:48 drbd kernel: drbd: module cleanup done.

To fix the issue I had to set the secondary as primary and to secondary
again. Then I was able to set the primary to primary node and everything
was up and running again.

Just appears to be a little bit unreliable to me. Even though I can
explain and understand the primary was possibly on some"split-brain"
scenario and discarded the disk as it was (possibly) temporarily
inaccessible.

Anyways, just wanted to share.

Christian