[DRBD-user] Primary not disconnecting Secondary with IO, problems (Was: Re: drbd-user Digest, Vol 65, Issue 4)

Fri Dec 4 10:36:15 CET 2009

Whoops, guess I fixated on one aspect of the problem. Specifically, the
"I/O errors on the secondary stop I/O on the primary". I was thinking
that NFS problems were affecting one or both hosts. I don't think the
master will ever deliberately disconnect from the secondary, unless a
split-brain occurs.

Can you go into more detail about what you mean by "stops I/O on the
primary"? Do I/O requests to the DRBD volume start failing? Does the
primary freeze/act odd in any other way?

James Masson wrote:
> Hi Jeff,
>
> thanks for the response.
>
> I'm using "rsize=32768,wsize=32768,nointr,timeo=300,noatime"
>
> But I don't see what that has to do with a DRBD Primary not detecting that it's Secondary is broken,
> and disconnecting it. Or the Secondary itself not realising it's broken, when it should have
> disconnected by itself.
>
> I can reproduce the issue without NFS, just using local filesystem interaction on the Primary.
>
> Am I missing something about how write timeouts work on DRBD?
>
> James
>
> jeff wrote:
>   
>> Are you mounting the NFS volumes with a timeout? I seem to recall that
>> an NFS timeout can really screw with a system, whether it's primary or
>> secondary. I usually mount my NFS with the soft,timeo=30 options.
>>
>> Hope that helps.
>>     
>>> Message: 2
>>> Date: Wed, 02 Dec 2009 09:21:45 +0000
>>> From: James Masson <james.masson at tradefair.com>
>>> Subject: Re: [DRBD-user] Primary not disconnecting Secondary with IO
>>> 	problems
>>> To: drbd-user at lists.linbit.com
>>> Message-ID: <4B1631A9.4070605 at tradefair.com>
>>> Content-Type: text/plain; charset=ISO-8859-1
>>>
>>>
>>> has anybody seen this before, got any insight?
>>>
>>> James
>>>
>>> James Masson wrote:
>>>   
>>>       
>>>> Hi list,
>>>>
>>>> I'm using DRBD and NFS to provide HA to Virtual Machine images between pairs of storage servers.
>>>>
>>>> Systems are RHEL5.4 2.6.18-164.el5 + drbd8.3 from Centos Extras
>>>>
>>>> We've been having issues where disk I/O problems on the DRBD Secondary stops all IO to the Primary
>>>> too. DRBD doesn't seem to recognise these disk I/O problems, the Secondary isn't disconnected
>>>> automatically. Everything just hangs.
>>>>
>>>> During this state:
>>>> If I try a "drbdadm disconnect all" on the Primary, the command hangs.
>>>> If I try this on the Secondary, the command eventually completes, and NFS I/O returns to normal
>>>> operation on the Primary.
>>>>
>>>> I've tried the following things to fix this:
>>>>
>>>> 1) Putting in a custom local-io-error handler to hard reset the problem node.
>>>>
>>>> This never triggers. Just like the default "detach", never triggers.
>>>>
>>>> 2) Changing the net connection parameters to:
>>>>
>>>> 	net {
>>>> 		ko-count 2;
>>>> 		timeout 20;
>>>> 	}
>>>>
>>>> Again, this never triggers.
>>>>
>>>>
>>>> 3) Changing the protocol used from C to B
>>>>
>>>> Doesn't have any effect on the issue - I'd prefer to use C anyway.
>>>>
>>>>
>>>> Any further ideas on how to track this issue down and fix it?
>>>>
>>>> thanks
>>>>
>>>> James Masson
>>>> _______________________________________________
>>>> drbd-user mailing list
>>>> drbd-user at lists.linbit.com
>>>> http://lists.linbit.com/mailman/listinfo/drbd-user
>>>>     
>>>>         
>> _______________________________________________
>> drbd-user mailing list
>> drbd-user at lists.linbit.com
>> http://lists.linbit.com/mailman/listinfo/drbd-user
>>