[DRBD-user] DRBD within VM

Pascal BERTON pascal.berton3 at free.fr
Wed Sep 7 10:56:08 CEST 2011

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hi Christian!

To what I understand in your mail, the key fact here is that your VM
(Primary, correct ?) has booted with an updated kernel. You don't explain
why, was it a human, wanted operation ? Nevertheless I just can't figure out
any other one in fact...
Well, so the kernel has changed... So the VMware tools are dead... So the
flexible or even VMXNET3 (Hope you used this one which is much more
efficient) network drivers are dead... So your network doesn't work any
more...
Then, if you're virtualizing a storage server, you probably search for
performances, so I assume you also have configured your DRBD devices on a
"VMware Paravirtual SCSI" vdisk... Wich requires a driver that comes with
the VMware tools... Which are dead...
Therefore, according to what you explain, it sounds fairly logical : No
tools, no network, and no disks...
The first thing you have to do then is to reinstall/compile the VMware
Tools, that may require the kernel headers of your new kernel (classic
headache).
Regarding the fact that you had to set your secondary primary, then
secondary again, wouldn't it be related to the generation identifiers in the
metadatas ? One can read p.103 of the 8.4 manual that "When a node loses
connection to its peer (either by network failure or manual intervention)
DRBD modifies its local GI...". Which has hopefully be the case for your
secondary, right ? Not sure, just an idea...

Best regards,

Pascal.

-----Message d'origine-----
De : drbd-user-bounces at lists.linbit.com
[mailto:drbd-user-bounces at lists.linbit.com] De la part de Christian Völker
Envoyé : mercredi 7 septembre 2011 09:00
À : drbd-user at lists.linbit.com
Objet : [DRBD-user] DRBD within VM

Hi all,

I had an issue here to tell- I assume this behavior is intended but this
time it caused issues even with clean shut down.

First, my setup is a single primary (Virtual Machine/ VM) and single
secondary (physical box). Due to power outage I had to shutdown the
boxes. Clear shutdown, no hard power off!
I shut down the secondary first then followed by the primary.  This is
log from primary:
Sep  6 08:11:20 backuppc shutdown[20850]: shutting down for system halt
Sep  6 08:11:23 backuppc kernel: drbd0: State change failed: Device is
held open by someone
Sep  6 08:11:23 backuppc kernel: drbd0:   state = { cs:Connected
st:Primary/Secondary ds:UpToDate/UpToDate r--- }
Sep  6 08:11:23 backuppc kernel: drbd0:  wanted = { cs:Connected
st:Secondary/Secondary ds:UpToDate/UpToDate r--- }
Sep  6 08:11:26 backuppc kernel: VMware memory control driver unloaded
Sep  6 08:11:26 backuppc kernel: Removing vmci device
Sep  6 08:11:26 backuppc kernel: Resetting vmci device
Sep  6 08:11:26 backuppc kernel: Unregistered vmci device.
Sep  6 08:11:26 backuppc kernel: ACPI: PCI interrupt for device
0000:00:07.7 disabled
Sep  6 08:11:39 backuppc kernel: drbd0: peer( Secondary -> Unknown )
conn( Connected -> TearDown ) pdsk( UpToDate -> DUnknown )
Sep  6 08:11:39 backuppc kernel: drbd0: Writing meta data super block now.
Sep  6 08:11:39 backuppc kernel: drbd0: Creating new current UUID
Sep  6 08:11:39 backuppc kernel: drbd0: Writing meta data super block now.
Sep  6 08:11:39 backuppc kernel: drbd0: meta connection shut down by peer.
Sep  6 08:11:39 backuppc kernel: drbd0: asender terminated
Sep  6 08:11:39 backuppc kernel: drbd0: Terminating asender thread
Sep  6 08:11:39 backuppc kernel: drbd0: tl_clear()
Sep  6 08:11:39 backuppc kernel: drbd0: Connection closed
Sep  6 08:11:39 backuppc kernel: drbd0: conn( TearDown -> Unconnected )
Sep  6 08:11:39 backuppc kernel: drbd0: receiver terminated
Sep  6 08:11:39 backuppc kernel: drbd0: receiver (re)started
Sep  6 08:11:39 backuppc kernel: drbd0: conn( Unconnected -> WFConnection )


Both where in consistent "Up-to-date" state before.
The drbd devices didn't come up properly after outage.

The root cause seems to be in the disk driver needed to access the disk
inside the VM. The primary booted with an updated kernel so the SCSI
driver was not available. Because of this drbd could not access the
physical disk (and the secondary was still powered off) and went into
diskless mode. Additionally the network was not available so it couldn't
access any network resources. I'm not sure about the disk access but
I'm, sure about the missing network. See logfile for this:

Sep  6 08:19:30 backuppc kernel: drbd0: Starting receiver thread (from
drbd0_worker [3082])
Sep  6 08:19:30 backuppc kernel: drbd0: receiver (re)started
Sep  6 08:19:30 backuppc kernel: drbd0: conn( Unconnected -> WFConnection )
Sep  6 08:19:30 backuppc kernel: drbd0: Unable to bind source sock (-99)
Sep  6 08:19:30 backuppc last message repeated 2 times
Sep  6 08:19:30 backuppc kernel: drbd0: Unable to bind sock2 (-99)
Sep  6 08:19:30 backuppc kernel: drbd0: conn( WFConnection ->
Disconnecting )
Sep  6 08:19:30 backuppc kernel: drbd0: Discarding network configuration.
Sep  6 08:19:30 backuppc kernel: drbd0: tl_clear()
Sep  6 08:19:30 backuppc kernel: drbd0: Connection closed
Sep  6 08:19:30 backuppc kernel: drbd0: conn( Disconnecting -> StandAlone )
Sep  6 08:19:30 backuppc kernel: drbd0: ASSERT( mdev->receiver.t_state
== None ) in
/home/buildsvn/rpmbuild/BUILD/drbd-8.2.6/_kmod_build_/drbd/drbd_main.c:2412
Sep  6 08:19:30 backuppc kernel: drbd0: drbd_bm_resize called with
capacity == 0
Sep  6 08:19:30 backuppc kernel: drbd0: worker terminated
Sep  6 08:19:30 backuppc kernel: drbd0: Terminating worker thread
Sep  6 08:19:30 backuppc kernel: drbd0: receiver terminated
Sep  6 08:19:30 backuppc kernel: drbd0: Terminating receiver thread
Sep  6 08:19:30 backuppc kernel: drbd0: State change failed: Refusing to
be Primary without at least one UpToDate disk
Sep  6 08:19:30 backuppc kernel: drbd0:   state = { cs:StandAlone
st:Secondary/Unknown ds:Diskless/DUnknown r--- }
Sep  6 08:19:30 backuppc kernel: drbd0:  wanted = { cs:StandAlone
st:Primary/Unknown ds:Diskless/DUnknown r--- }

This is the result of "cat /proc/drbd" on the primary one after fixing
network and driver issues now:
version: 8.2.6 (api:88/proto:86-88)
GIT-hash: 3e69822d3bb4920a8c1bfdf7d647169eba7d2eb4 build by
buildsvn at c5-i386-build, 2008-10-03 11:42:32
 0: cs:Connected st:Secondary/Secondary ds:Diskless/UpToDate A r---
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 oos:0

And on the secondary:
version: 8.2.6 (api:88/proto:86-88)
GIT-hash: 3e69822d3bb4920a8c1bfdf7d647169eba7d2eb4 build by
buildsvn at c5-i386-build, 2008-10-03 11:42:32
 0: cs:Connected st:Secondary/Secondary ds:UpToDate/Diskless A r---
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 oos:0


These are the logfiles writen at time of shutdown [secondary]:
Sep  6 08:10:47 drbd kernel: drbd0: peer( Primary -> Unknown ) conn(
Connected -> Disconnecting ) pdsk( UpToDate -> DUnknown )
Sep  6 08:10:47 drbd kernel: drbd0: short read expecting header on sock:
r=-512
Sep  6 08:10:47 drbd kernel: drbd0: asender terminated
Sep  6 08:10:47 drbd kernel: drbd0: Terminating asender thread
Sep  6 08:10:47 drbd kernel: drbd0: Writing meta data super block now.
Sep  6 08:10:48 drbd kernel: drbd0: tl_clear()
Sep  6 08:10:48 drbd kernel: drbd0: Connection closed
Sep  6 08:10:48 drbd kernel: drbd0: conn( Disconnecting -> StandAlone )
Sep  6 08:10:48 drbd kernel: drbd0: receiver terminated
Sep  6 08:10:48 drbd kernel: drbd0: Terminating receiver thread
Sep  6 08:10:48 drbd kernel: drbd0: disk( UpToDate -> Diskless )
Sep  6 08:10:48 drbd kernel: drbd0: drbd_bm_resize called with capacity == 0
Sep  6 08:10:48 drbd kernel: drbd0: worker terminated
Sep  6 08:10:48 drbd kernel: drbd0: Terminating worker thread
Sep  6 08:10:48 drbd kernel: drbd: module cleanup done.


To fix the issue I had to set the secondary as primary and to secondary
again. Then I was able to set the primary to primary node and everything
was up and running again.

Just appears to be a little bit unreliable to me. Even though I can
explain and understand the primary was possibly on some"split-brain"
scenario and discarded the disk as it was (possibly) temporarily
inaccessible.

Anyways, just wanted to share.

Christian




_______________________________________________
drbd-user mailing list
drbd-user at lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user




More information about the drbd-user mailing list