[DRBD-user] drbd-user Digest, Vol 65, Issue 4

Fri Dec 4 10:22:53 CET 2009

Hi Jeff,

thanks for the response.

I'm using "rsize=32768,wsize=32768,nointr,timeo=300,noatime"

But I don't see what that has to do with a DRBD Primary not detecting that it's Secondary is broken,
and disconnecting it. Or the Secondary itself not realising it's broken, when it should have
disconnected by itself.

I can reproduce the issue without NFS, just using local filesystem interaction on the Primary.

Am I missing something about how write timeouts work on DRBD?

James

jeff wrote:
> Are you mounting the NFS volumes with a timeout? I seem to recall that
> an NFS timeout can really screw with a system, whether it's primary or
> secondary. I usually mount my NFS with the soft,timeo=30 options.
> 
> Hope that helps.
>> Message: 2
>> Date: Wed, 02 Dec 2009 09:21:45 +0000
>> From: James Masson <james.masson at tradefair.com>
>> Subject: Re: [DRBD-user] Primary not disconnecting Secondary with IO
>> 	problems
>> To: drbd-user at lists.linbit.com
>> Message-ID: <4B1631A9.4070605 at tradefair.com>
>> Content-Type: text/plain; charset=ISO-8859-1
>>
>>
>> has anybody seen this before, got any insight?
>>
>> James
>>
>> James Masson wrote:
>>   
>>> Hi list,
>>>
>>> I'm using DRBD and NFS to provide HA to Virtual Machine images between pairs of storage servers.
>>>
>>> Systems are RHEL5.4 2.6.18-164.el5 + drbd8.3 from Centos Extras
>>>
>>> We've been having issues where disk I/O problems on the DRBD Secondary stops all IO to the Primary
>>> too. DRBD doesn't seem to recognise these disk I/O problems, the Secondary isn't disconnected
>>> automatically. Everything just hangs.
>>>
>>> During this state:
>>> If I try a "drbdadm disconnect all" on the Primary, the command hangs.
>>> If I try this on the Secondary, the command eventually completes, and NFS I/O returns to normal
>>> operation on the Primary.
>>>
>>> I've tried the following things to fix this:
>>>
>>> 1) Putting in a custom local-io-error handler to hard reset the problem node.
>>>
>>> This never triggers. Just like the default "detach", never triggers.
>>>
>>> 2) Changing the net connection parameters to:
>>>
>>> 	net {
>>> 		ko-count 2;
>>> 		timeout 20;
>>> 	}
>>>
>>> Again, this never triggers.
>>>
>>>
>>> 3) Changing the protocol used from C to B
>>>
>>> Doesn't have any effect on the issue - I'd prefer to use C anyway.
>>>
>>>
>>> Any further ideas on how to track this issue down and fix it?
>>>
>>> thanks
>>>
>>> James Masson
>>> _______________________________________________
>>> drbd-user mailing list
>>> drbd-user at lists.linbit.com
>>> http://lists.linbit.com/mailman/listinfo/drbd-user
>>>     
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user