[DRBD-user] drbd 8.3.9 hung tasks while syncing

Thu Jun 23 20:40:26 CEST 2011

Hi,

I'm experimenting with drbd for two openstreetmap tile servers (dell
R210, 16G mem, ubuntu natty). No cluster manager, just drbd over md
raid0 to have something that resembles a raid0+1 setup.

One server, drbd primary, working ok, is at the moment happily importing
the 16G planet data into postgresql. However, the secondary server is
not happy. I cannot remember how many crashes and hung tasks the second
server has experienced but I cannot seem to be able to blame any
*hardware* as the culprit (mem is ok, disks are ok, swapped network
cards, contents of root partition (kernel & userland software) on both
servers is the same).

The server is started with 'delayacct hpet=disable nohz=off' as parameters.

This is what happened this afternoon, four minutes after I started syncing:

Jun 23 16:24:27 nadir kernel: [167457.306951] block drbd0: Began resync
as SyncTarget (will sync 5328613760 KB [1332153440 bits set]).
Jun 23 16:28:41 nadir kernel: [167710.031697] INFO: task kworker/u:1:14
blocked for more than 120 seconds.
Jun 23 16:28:41 nadir kernel: [167710.038500] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 23 16:28:41 nadir kernel: [167710.046445] kworker/u:1     D
0000000000000000     0    14      2 0x00000000
Jun 23 16:28:41 nadir kernel: [167710.046449]  ffff88041f995c00
0000000000000046 ffff88041f995fd8 ffff88041f994000
Jun 23 16:28:41 nadir kernel: [167710.046452]  0000000000013d00
ffff88041f933178 ffff88041f995fd8 0000000000013d00
Jun 23 16:28:41 nadir kernel: [167710.046455]  ffffffff81a0b020
ffff88041f932dc0 0000000000000010 ffff88041aafa000
Jun 23 16:28:41 nadir kernel: [167710.046457] Call Trace:
Jun 23 16:28:41 nadir kernel: [167710.046469]  [<ffffffffa0241145>]
drbd_req_state+0x165/0x400 [drbd]
Jun 23 16:28:41 nadir kernel: [167710.046475]  [<ffffffff81087940>] ?
autoremove_wake_function+0x0/0x40
Jun 23 16:28:41 nadir kernel: [167710.046480]  [<ffffffffa0244bf0>] ?
drbd_nl_disconnect+0x0/0x190 [drbd]
Jun 23 16:28:41 nadir kernel: [167710.046485]  [<ffffffffa0241412>]
_drbd_request_state+0x32/0xe0 [drbd]
Jun 23 16:28:41 nadir kernel: [167710.046491]  [<ffffffff8105e71a>] ?
load_balance+0xca/0x5a0
Jun 23 16:28:41 nadir kernel: [167710.046495]  [<ffffffff8108e40d>] ?
sched_clock_cpu+0xbd/0x110
Jun 23 16:28:41 nadir kernel: [167710.046500]  [<ffffffffa0244bf0>] ?
drbd_nl_disconnect+0x0/0x190 [drbd]
Jun 23 16:28:41 nadir kernel: [167710.046506]  [<ffffffffa0244c1e>]
drbd_nl_disconnect+0x2e/0x190 [drbd]
Jun 23 16:28:41 nadir kernel: [167710.046511]  [<ffffffffa024ab16>]
drbd_connector_callback+0x116/0x600 [drbd]
Jun 23 16:28:41 nadir kernel: [167710.046516]  [<ffffffff813b57e0>] ?
cn_queue_wrapper+0x0/0x50
Jun 23 16:28:41 nadir kernel: [167710.046518]  [<ffffffff813b5808>]
cn_queue_wrapper+0x28/0x50
Jun 23 16:28:41 nadir kernel: [167710.046522]  [<ffffffff8108224d>]
process_one_work+0x11d/0x420
Jun 23 16:28:41 nadir kernel: [167710.046526]  [<ffffffff81082ce9>]
worker_thread+0x169/0x360
Jun 23 16:28:41 nadir kernel: [167710.046529]  [<ffffffff81082b80>] ?
worker_thread+0x0/0x360
Jun 23 16:28:41 nadir kernel: [167710.046531]  [<ffffffff810871f6>]
kthread+0x96/0xa0
Jun 23 16:28:41 nadir kernel: [167710.046535]  [<ffffffff8100cde4>]
kernel_thread_helper+0x4/0x10
Jun 23 16:28:41 nadir kernel: [167710.046538]  [<ffffffff81087160>] ?
kthread+0x0/0xa0
Jun 23 16:28:41 nadir kernel: [167710.046540]  [<ffffffff8100cde0>] ?
kernel_thread_helper+0x0/0x10

Does anyone recognise this stack trace? What could be going on?

Second question: Is it normal for 'drbdadm' to timeout or
could it be related? This happens on both the primary and the secondary,
depends on the change I made in the config.

root at nadir:/etc/drbd.d# drbdadm disconnect r0
Command 'drbdsetup 0 disconnect' did not terminate within 5 seconds

thanks for any input or pointers,
Maarten.