Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Whoops, guess I fixated on one aspect of the problem. Specifically, the "I/O errors on the secondary stop I/O on the primary". I was thinking that NFS problems were affecting one or both hosts. I don't think the master will ever deliberately disconnect from the secondary, unless a split-brain occurs. Can you go into more detail about what you mean by "stops I/O on the primary"? Do I/O requests to the DRBD volume start failing? Does the primary freeze/act odd in any other way? James Masson wrote: > Hi Jeff, > > thanks for the response. > > I'm using "rsize=32768,wsize=32768,nointr,timeo=300,noatime" > > But I don't see what that has to do with a DRBD Primary not detecting that it's Secondary is broken, > and disconnecting it. Or the Secondary itself not realising it's broken, when it should have > disconnected by itself. > > I can reproduce the issue without NFS, just using local filesystem interaction on the Primary. > > Am I missing something about how write timeouts work on DRBD? > > James > > jeff wrote: > >> Are you mounting the NFS volumes with a timeout? I seem to recall that >> an NFS timeout can really screw with a system, whether it's primary or >> secondary. I usually mount my NFS with the soft,timeo=30 options. >> >> Hope that helps. >> >>> Message: 2 >>> Date: Wed, 02 Dec 2009 09:21:45 +0000 >>> From: James Masson <james.masson at tradefair.com> >>> Subject: Re: [DRBD-user] Primary not disconnecting Secondary with IO >>> problems >>> To: drbd-user at lists.linbit.com >>> Message-ID: <4B1631A9.4070605 at tradefair.com> >>> Content-Type: text/plain; charset=ISO-8859-1 >>> >>> >>> has anybody seen this before, got any insight? >>> >>> James >>> >>> James Masson wrote: >>> >>> >>>> Hi list, >>>> >>>> I'm using DRBD and NFS to provide HA to Virtual Machine images between pairs of storage servers. >>>> >>>> Systems are RHEL5.4 2.6.18-164.el5 + drbd8.3 from Centos Extras >>>> >>>> We've been having issues where disk I/O problems on the DRBD Secondary stops all IO to the Primary >>>> too. DRBD doesn't seem to recognise these disk I/O problems, the Secondary isn't disconnected >>>> automatically. Everything just hangs. >>>> >>>> During this state: >>>> If I try a "drbdadm disconnect all" on the Primary, the command hangs. >>>> If I try this on the Secondary, the command eventually completes, and NFS I/O returns to normal >>>> operation on the Primary. >>>> >>>> I've tried the following things to fix this: >>>> >>>> 1) Putting in a custom local-io-error handler to hard reset the problem node. >>>> >>>> This never triggers. Just like the default "detach", never triggers. >>>> >>>> 2) Changing the net connection parameters to: >>>> >>>> net { >>>> ko-count 2; >>>> timeout 20; >>>> } >>>> >>>> Again, this never triggers. >>>> >>>> >>>> 3) Changing the protocol used from C to B >>>> >>>> Doesn't have any effect on the issue - I'd prefer to use C anyway. >>>> >>>> >>>> Any further ideas on how to track this issue down and fix it? >>>> >>>> thanks >>>> >>>> James Masson >>>> _______________________________________________ >>>> drbd-user mailing list >>>> drbd-user at lists.linbit.com >>>> http://lists.linbit.com/mailman/listinfo/drbd-user >>>> >>>> >> _______________________________________________ >> drbd-user mailing list >> drbd-user at lists.linbit.com >> http://lists.linbit.com/mailman/listinfo/drbd-user >>