On Wed, Feb 9, 2011 at 6:30 AM, Dario Fiumicello - Antek <span dir="ltr"><<a href="mailto:fiumicello@antek.it">fiumicello@antek.it</a>></span> wrote:<br><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
Hi all, I have two Virtualbox VM running on two different physical hosts. The vm are interconnected with two gigabit ethernet for drbd sync and heartbeat.<br>
<br>
Suddenly I get this on master machine:<br>
<br>
Feb 9 10:53:24 mail1 kernel: [136200.650336] INFO: task jbd2/drbd0-8:13739 blocked for more than 120 seconds.<br>
Feb 9 10:53:24 mail1 kernel: [136200.650967] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.<br></blockquote><div><br></div><div>This is a warning, not an error. It simply states that a some tasks has been working for more than 2 minutes. Some tasks legitimately take more than 120 seconds to complete, the above is simply informative.</div>
<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><br>
And from this moment many other errors of blocked tasks appears (postfix, pickup and so on). The machine load was more than 25!<br></blockquote><div><br></div><div>It sounds like the DRBD block device is hung due to slow I/O response from one of the backing-devices on your VMs.</div>
<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
<br>
Obviously I cannot use the machine anymore and I needed to kill it in order to force the takeover on the slave. Halt didn't work either.<br></blockquote><div><br></div><div>That's not obvious at all. Your system shouldn't be entirely on DRBD. Even if your DRBD block device is unresponsive you should still be able to login and look around. What was your CPU load?</div>
<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
<br>
My question is: why did I get this error? What can I do to avoid it?<br></blockquote><div><br></div><div>You got this error because one of your VMs likely couldn't keep up, likely caused by load on one of the host servers. You can avoid it by going bare-metal.</div>
<div><br></div><div>The VMs are on different host servers right?</div><div><br></div><div>-JR </div></div>