[DRBD-user] HVM with block-drbd, possible solution

Fri Apr 30 00:08:40 CEST 2010

Hi, first of all many thanks to everybody (expecially Linbit) for the 
excellent work made on drbd.

I'm currently involved in setting up a Pacemaker / DRBD cluster to serve 
a bunch of Xen VMs.

My current configuration is:
- 2 identical DELL R410 servers with 4core Xeon and 8Gb RAM
- 4x1Tb SATA connected to Dell's PERC 6i battery backed RAID controllers 
configured as RAID10 on each server.
- DOM0 Slackware 13.0 X86_64 with Xen 4.0 compiled from source 
(2.6.31.13 Xenified kernel)
- OpenAIS 1.1.2 + PaceMaker 1.0.8 compiled from source
- DRBD 8.3.7 compiled from source

The configuration files for drbd and DomU's are local to each host 
(replicated by hand),
the only shared data on drbd are guest's block devices: one drbd 
resource per guest built on two identically sized LVM logical volumes, 
Xen uses /dev/drbdX as guest's block device.

The whole thing is working flawlessy and seems fast and stable too.

I've got a question regarding HVM's (windows guests) and Primary/Primary 
DRBD for live migration.
DRBD docs states that I can't use block-drbd helper script with HVM's 
(and other few cases)
I'm relatively new in setting up a pacemaker cluster so the question is :

How can I make pacemaker ocf RA take care of:
1) promoting drbd on target node
2) live migrating HVM
3) demoting drbd on start node
(Which is basically what the block-drbd helper is supposed to do)

If I haven't missed something setting ocf:linbit:drbd masters-max=1 
doesn't allow the drbd resource to be in Primary/Primary even during the 
(short) time of Xen live migration, on the other side setting 
max-masters=2 causes drbd to run constantly in Primary/Primary mode, 
posing some data corruption risks  (I don't want to use a clustered 
filesystem, because I want to store DomU's filesystems on a physical 
device for best performance).

In short I would like to leave drbd guest resources primary only on the 
active host (where the relative DomU has been started), set them to 
primary/primary only when migrating guest DomU on the other host and 
then quickly demote the resource on start host after migration is 
complete for safety reasons.

This apparently can't be done without some form of "cooperation" between 
the ocf:linbit:drbd and ocf:heartbeat:Xen resource agents... which is 
exactly what I'm looking for but apparently cannot find in pacemaker's docs.

Please tell me if I miss something crucial about ocf RAs and their usage 
in this situation.

Now the (apparently) good news...

The approach taken by block-drbd seemed more logical for me, having a 
single OCF RA managing and coordinating the whole transition (DomU 
migration + drbd promoton/demotion).

Digging around I've found this patch:
http://post.gmane.org/post.php?group=gmane.comp.emulators.xen.devel&followup=80598

many thanks and full credits to the author: James Harper

So I've investigated a little bit more, and come up to the point where I 
can instruct a (patched as described) qemu-dm to recognize drbd 
resources (specified as drbd:resource in Xen DomU cfg file) and map them 
to the correct /dev/drbd/by-res/xxx node in the HVM's... sadly this 
solved only a part of the problem.

Starting from a Primary/Secondary state and launching (xm create) the 
HVM on the "Primary" drbd host works perfectly.
After this  I can do live migration of the HVM to the other host and 
obtain the sequence of promotions/demotions from the block-drbd script, 
leaving the system in the expected state.

Starting from a Secondary/Secondary drbd state (DomU stopped on each 
host) when i "xm create" an HVM DomU qemu-dm is fired before launching 
the block-drbd helpers, so the HVM correctly maps the device but drbd 
has not yet been promoted to Primary and the DomU is immediately turned off.

BTW: Someone can explain this difference ?
Why the block-drbd script is called BEFORE starting a live migration 
(making also the destination host Primary before attempting to migrate) 
and not BEFORE (but after) attempting to create the DomU and map the vbd 
via qemu-dm ?

Looking at the state transitions in /proc/drbd during a "xm create" of 
my HVM DomU I saw the Secondary->Primary transition happens... normally 
followed by the inverse transition just after qemu-dm "finds" the 
resource in an unusable state and shuts down the creation of DomU.... it 
seems only a timing problem !

Qemu-dm was too fast (and even started a bit before) and checks the vbd 
BEFORE block-drbd script can promote to Primary... the logical (but 
badly hackish) solution for me was inserting a delay in qemu-dm process 
if the resource IS "drbd:".

So, the final state is :
- I can create HVM guests using "drbd:res" syntax in configuration files 
making block-drbd take care of drbd transitions
- I can migrate / live migrate HVM (windows) guests having block-drbd 
doing his job
- my solution is a bad hack (at least for creation of HVM) based on a 
delay inserted in qemu-dm to wait for block-drbd execution.

The complete patch to Xen 4.0 source (AGAIN THANKS TO : James Harper) is :

--- xenstore.c.orig     2010-04-29 23:23:45.720258686 +0200
+++ xenstore.c  2010-04-29 22:52:43.897264812 +0200
@@ -513,6 +513,15 @@
              params = newparams;
             format = &bdrv_raw;
          }
+       /* handle drbd mapping */
+       if (!strcmp(drv, "drbd")) {
+           char *newparams = malloc(17 + strlen(params) + 1);
+           sprintf(newparams, "/dev/drbd/by-res/%s", params);
+           free(params);
+           sleep(5);
+           params = newparams;
+           format = &bdrv_raw;
+       }

  #if 0
         /* Phantom VBDs are disabled because the use of paths


I've only added the sleep(5); statement to make qemu-dm relax a bit and 
wait for block-drbd to be called.

Please come up with your comments and ideas to stabilize and improve the 
patch making it less hackish (at least in my little addition) and 
possibly suitable for production use (probably finding a reliable way to 
"wait" for drbd state change in qemu-dm).


Sauro Saltini.