[DRBD-user] Hitting bugz 258?

Felix Frank ff at mpexnet.de
Thu Dec 16 18:15:46 CET 2010

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hi,

in the following setup
 * node A Kernel 2.6.24-21-xen, DRBD 8.3.1
 * node B Kernel 2.6.27.48-xen, DRBD 8.3.1
this is my scenario:
The peers are interconnected via WAN and share 7 DRBDs. Usually, the
ones on node B run in StandAlone, always Secondary for disaster recovery
purposes. Node A is alwas WFConnection.

For a sync operation today, drbdadm connect was successfully called for
each resource on B.

The syncing started well enough, but then I wanted to interrupt it and
issued a "drbdadm disconnect all" on B. This did not succeed, but threw
an error about drbdsetup not finishing in time, I believe.

After that, no more interaction with DRBD was possible, and one of the
resources logged a variation on

drbd11: [drbd11_worker/26923] sock_sendmsg time expired, ko = 4294966778

once every 6 seconds. That resource was still reported to be SyncTarget,
but transfer was stalled. Most of other resources finished the Resync
successfully.

The following tasks were being reported as hung:
INFO: task drbd15_receiver:27897 blocked for more than 120 seconds.
INFO: task drbd14_worker:2670 blocked for more than 120 seconds.
INFO: task cqueue:2586 blocked for more than 120 seconds.
INFO: task drbd16_worker:2629 blocked for more than 120 seconds.

On A, drbd11 reported no errors, but drbd15 did (like the one noted
above). Nevertheless, drbd15 finished the resync and stopped
complaining. (The errors in the log stop 40 seconds short of the resync
finish.)

Still, some kernel threads (pdflush?) remained in D state, and Xen on A
went unresponsive (the DRBDs are phy disks for Xen guests). As such, I
had no means to bring the DRBD devices down and finally just rebooted A.

It was at this point that an rmmod -f drbd on B finished, which before
had been blocked in D state up to then (after that I realized that
blocking DRBD traffic might have been a less destructive solution?)

I noticed a fixed deadlock in 3.8.5 and was wondering wether this could
be it?

Thanks in advance,
Felix



More information about the drbd-user mailing list