[DRBD-user] jbd2/drbd0 blocked for more than 120 seconds

Wed Feb 9 13:30:43 CET 2011

Hi all, I have two Virtualbox VM running on two different physical 
hosts. The vm are interconnected with two gigabit ethernet for drbd sync 
and heartbeat.

Suddenly I get this on master machine:

Feb  9 10:53:24 mail1 kernel: [136200.650336] INFO: task 
jbd2/drbd0-8:13739 blocked for more than 120 seconds.
Feb  9 10:53:24 mail1 kernel: [136200.650967] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb  9 10:53:24 mail1 kernel: [136200.651651] jbd2/drbd0-8  D 
0000000000000002     0 13739      2 0x00000000
Feb  9 10:53:24 mail1 kernel: [136200.651660]  ffff880030365b30 
0000000000000046 0000000000015bc0 0000000000015bc0
Feb  9 10:53:24 mail1 kernel: [136200.651668]  ffff88003cddb198 
ffff880030365fd8 0000000000015bc0 ffff88003cddade0
Feb  9 10:53:24 mail1 kernel: [136200.651676]  0000000000015bc0 
ffff880030365fd8 0000000000015bc0 ffff88003cddb198
Feb  9 10:53:24 mail1 kernel: [136200.651684] Call Trace:
Feb  9 10:53:24 mail1 kernel: [136200.651725]  [<ffffffff810f3cd0>] ? 
sync_page+0x0/0x50
Feb  9 10:53:24 mail1 kernel: [136200.651743]  [<ffffffff81559633>] 
io_schedule+0x73/0xc0
Feb  9 10:53:24 mail1 kernel: [136200.651751]  [<ffffffff810f3d0d>] 
sync_page+0x3d/0x50
Feb  9 10:53:24 mail1 kernel: [136200.651759]  [<ffffffff81559c7f>] 
__wait_on_bit+0x5f/0x90
Feb  9 10:53:24 mail1 kernel: [136200.651766]  [<ffffffff810f3ec3>] 
wait_on_page_bit+0x73/0x80
Feb  9 10:53:24 mail1 kernel: [136200.651775]  [<ffffffff81084440>] ? 
wake_bit_function+0x0/0x40
Feb  9 10:53:24 mail1 kernel: [136200.651790]  [<ffffffff810fe305>] ? 
pagevec_lookup_tag+0x25/0x40
Feb  9 10:53:24 mail1 kernel: [136200.651798]  [<ffffffff810f4355>] 
wait_on_page_writeback_range+0xf5/0x190
Feb  9 10:53:24 mail1 kernel: [136200.651805]  [<ffffffff810f441f>] 
filemap_fdatawait+0x2f/0x40
Feb  9 10:53:24 mail1 kernel: [136200.651814]  [<ffffffff8121c6d4>] 
jbd2_journal_commit_transaction+0x744/0x1280
Feb  9 10:53:24 mail1 kernel: [136200.651822]  [<ffffffff81076a59>] ? 
try_to_del_timer_sync+0x79/0xd0
Feb  9 10:53:24 mail1 kernel: [136200.651831]  [<ffffffff8122378d>] 
kjournald2+0xbd/0x220
Feb  9 10:53:24 mail1 kernel: [136200.651838]  [<ffffffff81084400>] ? 
autoremove_wake_function+0x0/0x40
Feb  9 10:53:24 mail1 kernel: [136200.651846]  [<ffffffff812236d0>] ? 
kjournald2+0x0/0x220
Feb  9 10:53:24 mail1 kernel: [136200.651853]  [<ffffffff81084086>] 
kthread+0x96/0xa0
Feb  9 10:53:24 mail1 kernel: [136200.651861]  [<ffffffff810131ea>] 
child_rip+0xa/0x20
Feb  9 10:53:24 mail1 kernel: [136200.651869]  [<ffffffff81083ff0>] ? 
kthread+0x0/0xa0
Feb  9 10:53:24 mail1 kernel: [136200.651876]  [<ffffffff810131e0>] ? 
child_rip+0x0/0x20

And from this moment many other errors of blocked tasks appears 
(postfix, pickup and so on). The machine load was more than 25!

Obviously I cannot use the machine anymore and I needed to kill it in 
order to force the takeover on the slave. Halt didn't work either.

My question is: why did I get this error? What can I do to avoid it?

Thanks

-- 
Dario Fiumicello - Antek S.r.l.
+3902890380 73