[DRBD-user] Timing issue in drbdadm wait-con-int

Nicolas Michaux nicolas-drbd-user at michaux.homelinux.org
Fri Dec 23 03:41:07 CET 2011

Hi all,

We are using drbd 8.3.11-0.3.1 on a dual Xen install (on SLES11 SP1) and 
we have a strange problem with the init script of drbd.

Each time we restart drbd with the init script, the init script never 
ends and ask us to type "yes" to abort waiting. However, if we check the 
resource state, they are all connected.

If we replace the line "$DRBDADM wait-con-int" by "$DRBDADM wait-connect 
all", everything is working as expected and all ressource are connected.

I think the problem is related to the number of resources (actually 37). 
Something is going bad in drbdadm when it launches 37 times drbdsetup 
<id> wait-connect. In the process list, we always have some drbdsetup 
process hanging and waiting for connection (but all ressources are 
already connected) : 
root     13960 12062  0 01:19 pts/0    00:00:00 /sbin/drbdadm wait-con-
root     13966 13960  0 01:19 pts/0    00:00:00 /sbin/drbdsetup 10 wait-
root     13972 13960  0 01:19 pts/0    00:00:00 /sbin/drbdsetup 8 wait-
root     13976 13960  0 01:19 pts/0    00:00:00 /sbin/drbdsetup 6 wait-
root     13977 13960  0 01:19 pts/0    00:00:00 /sbin/drbdsetup 30 wait-

I'm sure that resources 6, 8, 10 and 30 are well connected. If I kill 
these processes, the init script ends normally and everything is working 
fine. Each time we launch the init script, the resources pending have 
different id and the number of resources hanging varies from 0 to 5 (as 
observed until now).

I added some debug in drbdadm_main.c and it seems that adding a sleep(1) 
between each process launch improves the situation. Adding a sleep is 
never a solution, but it shows us that there is probably a timing issue 
in the drbadm calls.

Here is the patch I use : 
--- drbdadm_main.c.orig 2011-12-23 03:06:20.000000000 +0100
+++ drbdadm_main.c      2011-12-23 03:06:24.000000000 +0100
@@ -2328,6 +2328,7 @@
                argv[NA(argc)] = 0;
                m__system(argv, RETURN_PID, res, &pids[i++], NULL, 
+                sleep(1);
        wtime = global_options.dialog_refresh ? : -1;

I tested the init script in a loop and it works better, but hangs 
sometimes (not each time as before, but maybe 1/5). By putting a 
sleep(10), I got it running 78 times in a loop without problem, but it 
hanged on 79th call. By putting a sleep (100), I got it running 127 

Could somebody explain this behavior and provide a better patch to solve 
this timing issue?

Best regards,

