Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi, I'm experimenting with drbd for two openstreetmap tile servers (dell R210, 16G mem, ubuntu natty). No cluster manager, just drbd over md raid0 to have something that resembles a raid0+1 setup. One server, drbd primary, working ok, is at the moment happily importing the 16G planet data into postgresql. However, the secondary server is not happy. I cannot remember how many crashes and hung tasks the second server has experienced but I cannot seem to be able to blame any *hardware* as the culprit (mem is ok, disks are ok, swapped network cards, contents of root partition (kernel & userland software) on both servers is the same). The server is started with 'delayacct hpet=disable nohz=off' as parameters. This is what happened this afternoon, four minutes after I started syncing: Jun 23 16:24:27 nadir kernel: [167457.306951] block drbd0: Began resync as SyncTarget (will sync 5328613760 KB [1332153440 bits set]). Jun 23 16:28:41 nadir kernel: [167710.031697] INFO: task kworker/u:1:14 blocked for more than 120 seconds. Jun 23 16:28:41 nadir kernel: [167710.038500] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jun 23 16:28:41 nadir kernel: [167710.046445] kworker/u:1 D 0000000000000000 0 14 2 0x00000000 Jun 23 16:28:41 nadir kernel: [167710.046449] ffff88041f995c00 0000000000000046 ffff88041f995fd8 ffff88041f994000 Jun 23 16:28:41 nadir kernel: [167710.046452] 0000000000013d00 ffff88041f933178 ffff88041f995fd8 0000000000013d00 Jun 23 16:28:41 nadir kernel: [167710.046455] ffffffff81a0b020 ffff88041f932dc0 0000000000000010 ffff88041aafa000 Jun 23 16:28:41 nadir kernel: [167710.046457] Call Trace: Jun 23 16:28:41 nadir kernel: [167710.046469] [<ffffffffa0241145>] drbd_req_state+0x165/0x400 [drbd] Jun 23 16:28:41 nadir kernel: [167710.046475] [<ffffffff81087940>] ? autoremove_wake_function+0x0/0x40 Jun 23 16:28:41 nadir kernel: [167710.046480] [<ffffffffa0244bf0>] ? drbd_nl_disconnect+0x0/0x190 [drbd] Jun 23 16:28:41 nadir kernel: [167710.046485] [<ffffffffa0241412>] _drbd_request_state+0x32/0xe0 [drbd] Jun 23 16:28:41 nadir kernel: [167710.046491] [<ffffffff8105e71a>] ? load_balance+0xca/0x5a0 Jun 23 16:28:41 nadir kernel: [167710.046495] [<ffffffff8108e40d>] ? sched_clock_cpu+0xbd/0x110 Jun 23 16:28:41 nadir kernel: [167710.046500] [<ffffffffa0244bf0>] ? drbd_nl_disconnect+0x0/0x190 [drbd] Jun 23 16:28:41 nadir kernel: [167710.046506] [<ffffffffa0244c1e>] drbd_nl_disconnect+0x2e/0x190 [drbd] Jun 23 16:28:41 nadir kernel: [167710.046511] [<ffffffffa024ab16>] drbd_connector_callback+0x116/0x600 [drbd] Jun 23 16:28:41 nadir kernel: [167710.046516] [<ffffffff813b57e0>] ? cn_queue_wrapper+0x0/0x50 Jun 23 16:28:41 nadir kernel: [167710.046518] [<ffffffff813b5808>] cn_queue_wrapper+0x28/0x50 Jun 23 16:28:41 nadir kernel: [167710.046522] [<ffffffff8108224d>] process_one_work+0x11d/0x420 Jun 23 16:28:41 nadir kernel: [167710.046526] [<ffffffff81082ce9>] worker_thread+0x169/0x360 Jun 23 16:28:41 nadir kernel: [167710.046529] [<ffffffff81082b80>] ? worker_thread+0x0/0x360 Jun 23 16:28:41 nadir kernel: [167710.046531] [<ffffffff810871f6>] kthread+0x96/0xa0 Jun 23 16:28:41 nadir kernel: [167710.046535] [<ffffffff8100cde4>] kernel_thread_helper+0x4/0x10 Jun 23 16:28:41 nadir kernel: [167710.046538] [<ffffffff81087160>] ? kthread+0x0/0xa0 Jun 23 16:28:41 nadir kernel: [167710.046540] [<ffffffff8100cde0>] ? kernel_thread_helper+0x0/0x10 Does anyone recognise this stack trace? What could be going on? Second question: Is it normal for 'drbdadm' to timeout or could it be related? This happens on both the primary and the secondary, depends on the change I made in the config. root at nadir:/etc/drbd.d# drbdadm disconnect r0 Command 'drbdsetup 0 disconnect' did not terminate within 5 seconds thanks for any input or pointers, Maarten.