Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi Jeff, The DRBD primary just stops accepting I/O to the DRBD volume. There's no I/O failure or timeout. DRBD as a whole just hangs, primary and secondary, no error messages (apart from hardware cause - the 3ware controller on the secondary losing it's enclosure). It's like the primary is waiting for a write acknowledgment from the secondary, and this blocks DRBD I/O operations. This is expected, but a disconnect timeout should fire after a (configurable) time. Both Primary and Secondary have the ability to detect the issue, but neither does, that's what's puzzling. Disk I/O on the Primary to non-drbd filesystems continues just fine. Disk I/O on the Secondary fails as a whole, as expected, as the system just lost all it's disks. Can anyone tell me which side of DRBD (primary/secondary) is supposed to detect I/O failure and disconnect the DRBD group in situations like this? Is it one side, or both? James Jeff Orr wrote: > Whoops, guess I fixated on one aspect of the problem. Specifically, the > "I/O errors on the secondary stop I/O on the primary". I was thinking > that NFS problems were affecting one or both hosts. I don't think the > master will ever deliberately disconnect from the secondary, unless a > split-brain occurs. > > Can you go into more detail about what you mean by "stops I/O on the > primary"? Do I/O requests to the DRBD volume start failing? Does the > primary freeze/act odd in any other way? > > James Masson wrote: >> Hi Jeff, >> >> thanks for the response. >> >> I'm using "rsize=32768,wsize=32768,nointr,timeo=300,noatime" >> >> But I don't see what that has to do with a DRBD Primary not detecting that it's Secondary is broken, >> and disconnecting it. Or the Secondary itself not realising it's broken, when it should have >> disconnected by itself. >> >> I can reproduce the issue without NFS, just using local filesystem interaction on the Primary. >> >> Am I missing something about how write timeouts work on DRBD? >> >> James >> >> jeff wrote: >> >>> Are you mounting the NFS volumes with a timeout? I seem to recall that >>> an NFS timeout can really screw with a system, whether it's primary or >>> secondary. I usually mount my NFS with the soft,timeo=30 options. >>> >>> Hope that helps. >>> >>>> Message: 2 >>>> Date: Wed, 02 Dec 2009 09:21:45 +0000 >>>> From: James Masson <james.masson at tradefair.com> >>>> Subject: Re: [DRBD-user] Primary not disconnecting Secondary with IO >>>> problems >>>> To: drbd-user at lists.linbit.com >>>> Message-ID: <4B1631A9.4070605 at tradefair.com> >>>> Content-Type: text/plain; charset=ISO-8859-1 >>>> >>>> >>>> has anybody seen this before, got any insight? >>>> >>>> James >>>> >>>> James Masson wrote: >>>> >>>> >>>>> Hi list, >>>>> >>>>> I'm using DRBD and NFS to provide HA to Virtual Machine images between pairs of storage servers. >>>>> >>>>> Systems are RHEL5.4 2.6.18-164.el5 + drbd8.3 from Centos Extras >>>>> >>>>> We've been having issues where disk I/O problems on the DRBD Secondary stops all IO to the Primary >>>>> too. DRBD doesn't seem to recognise these disk I/O problems, the Secondary isn't disconnected >>>>> automatically. Everything just hangs. >>>>> >>>>> During this state: >>>>> If I try a "drbdadm disconnect all" on the Primary, the command hangs. >>>>> If I try this on the Secondary, the command eventually completes, and NFS I/O returns to normal >>>>> operation on the Primary. >>>>> >>>>> I've tried the following things to fix this: >>>>> >>>>> 1) Putting in a custom local-io-error handler to hard reset the problem node. >>>>> >>>>> This never triggers. Just like the default "detach", never triggers. >>>>> >>>>> 2) Changing the net connection parameters to: >>>>> >>>>> net { >>>>> ko-count 2; >>>>> timeout 20; >>>>> } >>>>> >>>>> Again, this never triggers. >>>>> >>>>> >>>>> 3) Changing the protocol used from C to B >>>>> >>>>> Doesn't have any effect on the issue - I'd prefer to use C anyway. >>>>> >>>>> >>>>> Any further ideas on how to track this issue down and fix it? >>>>> >>>>> thanks >>>>> >>>>> James Masson >>>>> _______________________________________________ >>>>> drbd-user mailing list >>>>> drbd-user at lists.linbit.com >>>>> http://lists.linbit.com/mailman/listinfo/drbd-user >>>>> >>>>> >>> _______________________________________________ >>> drbd-user mailing list >>> drbd-user at lists.linbit.com >>> http://lists.linbit.com/mailman/listinfo/drbd-user >>> > > _______________________________________________ > drbd-user mailing list > drbd-user at lists.linbit.com > http://lists.linbit.com/mailman/listinfo/drbd-user