[DRBD-user] Primary not disconnecting Secondary with IO, problems (Was: Re: drbd-user Digest, Vol 65, Issue 4)

James Masson james.masson at tradefair.com
Mon Dec 7 10:40:31 CET 2009

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hi Jeff,

The DRBD primary just stops accepting I/O to the DRBD volume. There's no I/O failure or timeout.
DRBD as a whole just hangs, primary and secondary, no error messages (apart from hardware cause -
the 3ware controller on the secondary losing it's enclosure).

It's like the primary is waiting for a write acknowledgment from the secondary, and this blocks DRBD
I/O operations. This is expected, but a disconnect timeout should fire after a (configurable) time.

Both Primary and Secondary have the ability to detect the issue, but neither does, that's what's
puzzling.

Disk I/O on the Primary to non-drbd filesystems continues just fine.

Disk I/O on the Secondary fails as a whole, as expected, as the system just lost all it's disks.

Can anyone tell me which side of DRBD (primary/secondary) is supposed to detect I/O failure and
disconnect the DRBD group in situations like this? Is it one side, or both?

James

Jeff Orr wrote:
> Whoops, guess I fixated on one aspect of the problem. Specifically, the
> "I/O errors on the secondary stop I/O on the primary". I was thinking
> that NFS problems were affecting one or both hosts. I don't think the
> master will ever deliberately disconnect from the secondary, unless a
> split-brain occurs.
> 
> Can you go into more detail about what you mean by "stops I/O on the
> primary"? Do I/O requests to the DRBD volume start failing? Does the
> primary freeze/act odd in any other way?
> 
> James Masson wrote:
>> Hi Jeff,
>>
>> thanks for the response.
>>
>> I'm using "rsize=32768,wsize=32768,nointr,timeo=300,noatime"
>>
>> But I don't see what that has to do with a DRBD Primary not detecting that it's Secondary is broken,
>> and disconnecting it. Or the Secondary itself not realising it's broken, when it should have
>> disconnected by itself.
>>
>> I can reproduce the issue without NFS, just using local filesystem interaction on the Primary.
>>
>> Am I missing something about how write timeouts work on DRBD?
>>
>> James
>>
>> jeff wrote:
>>   
>>> Are you mounting the NFS volumes with a timeout? I seem to recall that
>>> an NFS timeout can really screw with a system, whether it's primary or
>>> secondary. I usually mount my NFS with the soft,timeo=30 options.
>>>
>>> Hope that helps.
>>>     
>>>> Message: 2
>>>> Date: Wed, 02 Dec 2009 09:21:45 +0000
>>>> From: James Masson <james.masson at tradefair.com>
>>>> Subject: Re: [DRBD-user] Primary not disconnecting Secondary with IO
>>>> 	problems
>>>> To: drbd-user at lists.linbit.com
>>>> Message-ID: <4B1631A9.4070605 at tradefair.com>
>>>> Content-Type: text/plain; charset=ISO-8859-1
>>>>
>>>>
>>>> has anybody seen this before, got any insight?
>>>>
>>>> James
>>>>
>>>> James Masson wrote:
>>>>   
>>>>       
>>>>> Hi list,
>>>>>
>>>>> I'm using DRBD and NFS to provide HA to Virtual Machine images between pairs of storage servers.
>>>>>
>>>>> Systems are RHEL5.4 2.6.18-164.el5 + drbd8.3 from Centos Extras
>>>>>
>>>>> We've been having issues where disk I/O problems on the DRBD Secondary stops all IO to the Primary
>>>>> too. DRBD doesn't seem to recognise these disk I/O problems, the Secondary isn't disconnected
>>>>> automatically. Everything just hangs.
>>>>>
>>>>> During this state:
>>>>> If I try a "drbdadm disconnect all" on the Primary, the command hangs.
>>>>> If I try this on the Secondary, the command eventually completes, and NFS I/O returns to normal
>>>>> operation on the Primary.
>>>>>
>>>>> I've tried the following things to fix this:
>>>>>
>>>>> 1) Putting in a custom local-io-error handler to hard reset the problem node.
>>>>>
>>>>> This never triggers. Just like the default "detach", never triggers.
>>>>>
>>>>> 2) Changing the net connection parameters to:
>>>>>
>>>>> 	net {
>>>>> 		ko-count 2;
>>>>> 		timeout 20;
>>>>> 	}
>>>>>
>>>>> Again, this never triggers.
>>>>>
>>>>>
>>>>> 3) Changing the protocol used from C to B
>>>>>
>>>>> Doesn't have any effect on the issue - I'd prefer to use C anyway.
>>>>>
>>>>>
>>>>> Any further ideas on how to track this issue down and fix it?
>>>>>
>>>>> thanks
>>>>>
>>>>> James Masson
>>>>> _______________________________________________
>>>>> drbd-user mailing list
>>>>> drbd-user at lists.linbit.com
>>>>> http://lists.linbit.com/mailman/listinfo/drbd-user
>>>>>     
>>>>>         
>>> _______________________________________________
>>> drbd-user mailing list
>>> drbd-user at lists.linbit.com
>>> http://lists.linbit.com/mailman/listinfo/drbd-user
>>>     
> 
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user



More information about the drbd-user mailing list