[DRBD-user] Timing issue in drbdadm wait-con-int

Nicolas Michaux nicolas-drbd-user at michaux.homelinux.org
Fri Dec 23 03:41:07 CET 2011

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hi all,

We are using drbd 8.3.11-0.3.1 on a dual Xen install (on SLES11 SP1) and 
we have a strange problem with the init script of drbd.

Each time we restart drbd with the init script, the init script never 
ends and ask us to type "yes" to abort waiting. However, if we check the 
resource state, they are all connected.

If we replace the line "$DRBDADM wait-con-int" by "$DRBDADM wait-connect 
all", everything is working as expected and all ressource are connected.

I think the problem is related to the number of resources (actually 37). 
Something is going bad in drbdadm when it launches 37 times drbdsetup 
<id> wait-connect. In the process list, we always have some drbdsetup 
process hanging and waiting for connection (but all ressources are 
already connected) : 
root     13960 12062  0 01:19 pts/0    00:00:00 /sbin/drbdadm wait-con-
int
root     13966 13960  0 01:19 pts/0    00:00:00 /sbin/drbdsetup 10 wait-
connect
root     13972 13960  0 01:19 pts/0    00:00:00 /sbin/drbdsetup 8 wait-
connect
root     13976 13960  0 01:19 pts/0    00:00:00 /sbin/drbdsetup 6 wait-
connect
root     13977 13960  0 01:19 pts/0    00:00:00 /sbin/drbdsetup 30 wait-
connect

I'm sure that resources 6, 8, 10 and 30 are well connected. If I kill 
these processes, the init script ends normally and everything is working 
fine. Each time we launch the init script, the resources pending have 
different id and the number of resources hanging varies from 0 to 5 (as 
observed until now).

I added some debug in drbdadm_main.c and it seems that adding a sleep(1) 
between each process launch improves the situation. Adding a sleep is 
never a solution, but it shows us that there is probably a timing issue 
in the drbadm calls.

Here is the patch I use : 
--- drbdadm_main.c.orig 2011-12-23 03:06:20.000000000 +0100
+++ drbdadm_main.c      2011-12-23 03:06:24.000000000 +0100
@@ -2328,6 +2328,7 @@
                argv[NA(argc)] = 0;
 
                m__system(argv, RETURN_PID, res, &pids[i++], NULL, 
NULL);
+                sleep(1);
        }
 
        wtime = global_options.dialog_refresh ? : -1;

I tested the init script in a loop and it works better, but hangs 
sometimes (not each time as before, but maybe 1/5). By putting a 
sleep(10), I got it running 78 times in a loop without problem, but it 
hanged on 79th call. By putting a sleep (100), I got it running 127 
times.

Could somebody explain this behavior and provide a better patch to solve 
this timing issue?

Best regards,
Nicolas



More information about the drbd-user mailing list