Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Il 09/02/2011 20:00, J. Ryan Earl ha scritto: > On Wed, Feb 9, 2011 at 6:30 AM, Dario Fiumicello - Antek< > fiumicello at antek.it> wrote: > >> Hi all, I have two Virtualbox VM running on two different physical hosts. >> The vm are interconnected with two gigabit ethernet for drbd sync and >> heartbeat. >> >> Suddenly I get this on master machine: >> >> Feb 9 10:53:24 mail1 kernel: [136200.650336] INFO: task jbd2/drbd0-8:13739 >> blocked for more than 120 seconds. >> Feb 9 10:53:24 mail1 kernel: [136200.650967] "echo 0> >> /proc/sys/kernel/hung_task_timeout_secs" disables this message. >> > This is a warning, not an error. It simply states that a some tasks has > been working for more than 2 minutes. Some tasks legitimately take more > than 120 seconds to complete, the above is simply informative. > > >> And from this moment many other errors of blocked tasks appears (postfix, >> pickup and so on). The machine load was more than 25! >> > It sounds like the DRBD block device is hung due to slow I/O response from > one of the backing-devices on your VMs. > > >> Obviously I cannot use the machine anymore and I needed to kill it in order >> to force the takeover on the slave. Halt didn't work either. >> > That's not obvious at all. Your system shouldn't be entirely on DRBD. Even > if your DRBD block device is unresponsive you should still be able to login > and look around. What was your CPU load? Sorry, I wasn't precise. The machine still allows me to login via ssh and check its status. Uptime shows me a CPU load of 25 while top shows quite 0% of occupied cpu. I suspect this is dued to a hang in I/O. The services I wasn't able to use was the ones using drbd (like postfix, dovecot and so on). When I tried to force a takeover on the other machine I wasn't able to do it because the master (hanged) didn't release resources. >> My question is: why did I get this error? What can I do to avoid it? >> > You got this error because one of your VMs likely couldn't keep up, likely > caused by load on one of the host servers. You can avoid it by going > bare-metal. > > The VMs are on different host servers right? Exactly, vm's are on two sibiling hosts, each with three Sata HD in raid1. Thank you for your answer, cheers -- Dario Fiumicello - Antek S.r.l. +3902890380 73