[DRBD-user] jbd2/drbd0 blocked for more than 120 seconds

Wed Feb 9 20:25:55 CET 2011

Il 09/02/2011 20:00, J. Ryan Earl ha scritto:
> On Wed, Feb 9, 2011 at 6:30 AM, Dario Fiumicello - Antek<
> fiumicello at antek.it>  wrote:
>
>> Hi all, I have two Virtualbox VM running on two different physical hosts.
>> The vm are interconnected with two gigabit ethernet for drbd sync and
>> heartbeat.
>>
>> Suddenly I get this on master machine:
>>
>> Feb  9 10:53:24 mail1 kernel: [136200.650336] INFO: task jbd2/drbd0-8:13739
>> blocked for more than 120 seconds.
>> Feb  9 10:53:24 mail1 kernel: [136200.650967] "echo 0>
>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>
> This is a warning, not an error.  It simply states that a some tasks has
> been working for more than 2 minutes.  Some tasks legitimately take more
> than 120 seconds to complete, the above is simply informative.
>
>
>> And from this moment many other errors of blocked tasks appears (postfix,
>> pickup and so on). The machine load was more than 25!
>>
> It sounds like the DRBD block device is hung due to slow I/O response from
> one of the backing-devices on your VMs.
>
>
>> Obviously I cannot use the machine anymore and I needed to kill it in order
>> to force the takeover on the slave. Halt didn't work either.
>>
> That's not obvious at all.  Your system shouldn't be entirely on DRBD. Even
> if your DRBD block device is unresponsive you should still be able to login
> and look around.  What was your CPU load?

Sorry, I wasn't precise. The machine still allows me to login via ssh 
and check its status. Uptime shows me a CPU load of 25 while top shows 
quite 0% of occupied cpu. I suspect this is dued to a hang in I/O. The 
services I wasn't able to use was the ones using drbd (like postfix, 
dovecot and so on). When I tried to force a takeover on the other 
machine I wasn't able to do it because the master (hanged) didn't 
release resources.

>> My question is: why did I get this error? What can I do to avoid it?
>>
> You got this error because one of your VMs likely couldn't keep up, likely
> caused by load on one of the host servers.  You can avoid it by going
> bare-metal.
>
> The VMs are on different host servers right?
Exactly, vm's are on two sibiling hosts, each with three Sata HD in raid1.

Thank you for your answer, cheers

-- 
Dario Fiumicello - Antek S.r.l.
+3902890380 73