Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi all, We are using drbd 8.3.11-0.3.1 on a dual Xen install (on SLES11 SP1) and we have a strange problem with the init script of drbd. Each time we restart drbd with the init script, the init script never ends and ask us to type "yes" to abort waiting. However, if we check the resource state, they are all connected. If we replace the line "$DRBDADM wait-con-int" by "$DRBDADM wait-connect all", everything is working as expected and all ressource are connected. I think the problem is related to the number of resources (actually 37). Something is going bad in drbdadm when it launches 37 times drbdsetup <id> wait-connect. In the process list, we always have some drbdsetup process hanging and waiting for connection (but all ressources are already connected) : root 13960 12062 0 01:19 pts/0 00:00:00 /sbin/drbdadm wait-con- int root 13966 13960 0 01:19 pts/0 00:00:00 /sbin/drbdsetup 10 wait- connect root 13972 13960 0 01:19 pts/0 00:00:00 /sbin/drbdsetup 8 wait- connect root 13976 13960 0 01:19 pts/0 00:00:00 /sbin/drbdsetup 6 wait- connect root 13977 13960 0 01:19 pts/0 00:00:00 /sbin/drbdsetup 30 wait- connect I'm sure that resources 6, 8, 10 and 30 are well connected. If I kill these processes, the init script ends normally and everything is working fine. Each time we launch the init script, the resources pending have different id and the number of resources hanging varies from 0 to 5 (as observed until now). I added some debug in drbdadm_main.c and it seems that adding a sleep(1) between each process launch improves the situation. Adding a sleep is never a solution, but it shows us that there is probably a timing issue in the drbadm calls. Here is the patch I use : --- drbdadm_main.c.orig 2011-12-23 03:06:20.000000000 +0100 +++ drbdadm_main.c 2011-12-23 03:06:24.000000000 +0100 @@ -2328,6 +2328,7 @@ argv[NA(argc)] = 0; m__system(argv, RETURN_PID, res, &pids[i++], NULL, NULL); + sleep(1); } wtime = global_options.dialog_refresh ? : -1; I tested the init script in a loop and it works better, but hangs sometimes (not each time as before, but maybe 1/5). By putting a sleep(10), I got it running 78 times in a loop without problem, but it hanged on 79th call. By putting a sleep (100), I got it running 127 times. Could somebody explain this behavior and provide a better patch to solve this timing issue? Best regards, Nicolas