Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi all, I have two Virtualbox VM running on two different physical hosts. The vm are interconnected with two gigabit ethernet for drbd sync and heartbeat. Suddenly I get this on master machine: Feb 9 10:53:24 mail1 kernel: [136200.650336] INFO: task jbd2/drbd0-8:13739 blocked for more than 120 seconds. Feb 9 10:53:24 mail1 kernel: [136200.650967] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Feb 9 10:53:24 mail1 kernel: [136200.651651] jbd2/drbd0-8 D 0000000000000002 0 13739 2 0x00000000 Feb 9 10:53:24 mail1 kernel: [136200.651660] ffff880030365b30 0000000000000046 0000000000015bc0 0000000000015bc0 Feb 9 10:53:24 mail1 kernel: [136200.651668] ffff88003cddb198 ffff880030365fd8 0000000000015bc0 ffff88003cddade0 Feb 9 10:53:24 mail1 kernel: [136200.651676] 0000000000015bc0 ffff880030365fd8 0000000000015bc0 ffff88003cddb198 Feb 9 10:53:24 mail1 kernel: [136200.651684] Call Trace: Feb 9 10:53:24 mail1 kernel: [136200.651725] [<ffffffff810f3cd0>] ? sync_page+0x0/0x50 Feb 9 10:53:24 mail1 kernel: [136200.651743] [<ffffffff81559633>] io_schedule+0x73/0xc0 Feb 9 10:53:24 mail1 kernel: [136200.651751] [<ffffffff810f3d0d>] sync_page+0x3d/0x50 Feb 9 10:53:24 mail1 kernel: [136200.651759] [<ffffffff81559c7f>] __wait_on_bit+0x5f/0x90 Feb 9 10:53:24 mail1 kernel: [136200.651766] [<ffffffff810f3ec3>] wait_on_page_bit+0x73/0x80 Feb 9 10:53:24 mail1 kernel: [136200.651775] [<ffffffff81084440>] ? wake_bit_function+0x0/0x40 Feb 9 10:53:24 mail1 kernel: [136200.651790] [<ffffffff810fe305>] ? pagevec_lookup_tag+0x25/0x40 Feb 9 10:53:24 mail1 kernel: [136200.651798] [<ffffffff810f4355>] wait_on_page_writeback_range+0xf5/0x190 Feb 9 10:53:24 mail1 kernel: [136200.651805] [<ffffffff810f441f>] filemap_fdatawait+0x2f/0x40 Feb 9 10:53:24 mail1 kernel: [136200.651814] [<ffffffff8121c6d4>] jbd2_journal_commit_transaction+0x744/0x1280 Feb 9 10:53:24 mail1 kernel: [136200.651822] [<ffffffff81076a59>] ? try_to_del_timer_sync+0x79/0xd0 Feb 9 10:53:24 mail1 kernel: [136200.651831] [<ffffffff8122378d>] kjournald2+0xbd/0x220 Feb 9 10:53:24 mail1 kernel: [136200.651838] [<ffffffff81084400>] ? autoremove_wake_function+0x0/0x40 Feb 9 10:53:24 mail1 kernel: [136200.651846] [<ffffffff812236d0>] ? kjournald2+0x0/0x220 Feb 9 10:53:24 mail1 kernel: [136200.651853] [<ffffffff81084086>] kthread+0x96/0xa0 Feb 9 10:53:24 mail1 kernel: [136200.651861] [<ffffffff810131ea>] child_rip+0xa/0x20 Feb 9 10:53:24 mail1 kernel: [136200.651869] [<ffffffff81083ff0>] ? kthread+0x0/0xa0 Feb 9 10:53:24 mail1 kernel: [136200.651876] [<ffffffff810131e0>] ? child_rip+0x0/0x20 And from this moment many other errors of blocked tasks appears (postfix, pickup and so on). The machine load was more than 25! Obviously I cannot use the machine anymore and I needed to kill it in order to force the takeover on the slave. Halt didn't work either. My question is: why did I get this error? What can I do to avoid it? Thanks -- Dario Fiumicello - Antek S.r.l. +3902890380 73