Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi Jeff, thanks for the response. I'm using "rsize=32768,wsize=32768,nointr,timeo=300,noatime" But I don't see what that has to do with a DRBD Primary not detecting that it's Secondary is broken, and disconnecting it. Or the Secondary itself not realising it's broken, when it should have disconnected by itself. I can reproduce the issue without NFS, just using local filesystem interaction on the Primary. Am I missing something about how write timeouts work on DRBD? James jeff wrote: > Are you mounting the NFS volumes with a timeout? I seem to recall that > an NFS timeout can really screw with a system, whether it's primary or > secondary. I usually mount my NFS with the soft,timeo=30 options. > > Hope that helps. >> Message: 2 >> Date: Wed, 02 Dec 2009 09:21:45 +0000 >> From: James Masson <james.masson at tradefair.com> >> Subject: Re: [DRBD-user] Primary not disconnecting Secondary with IO >> problems >> To: drbd-user at lists.linbit.com >> Message-ID: <4B1631A9.4070605 at tradefair.com> >> Content-Type: text/plain; charset=ISO-8859-1 >> >> >> has anybody seen this before, got any insight? >> >> James >> >> James Masson wrote: >> >>> Hi list, >>> >>> I'm using DRBD and NFS to provide HA to Virtual Machine images between pairs of storage servers. >>> >>> Systems are RHEL5.4 2.6.18-164.el5 + drbd8.3 from Centos Extras >>> >>> We've been having issues where disk I/O problems on the DRBD Secondary stops all IO to the Primary >>> too. DRBD doesn't seem to recognise these disk I/O problems, the Secondary isn't disconnected >>> automatically. Everything just hangs. >>> >>> During this state: >>> If I try a "drbdadm disconnect all" on the Primary, the command hangs. >>> If I try this on the Secondary, the command eventually completes, and NFS I/O returns to normal >>> operation on the Primary. >>> >>> I've tried the following things to fix this: >>> >>> 1) Putting in a custom local-io-error handler to hard reset the problem node. >>> >>> This never triggers. Just like the default "detach", never triggers. >>> >>> 2) Changing the net connection parameters to: >>> >>> net { >>> ko-count 2; >>> timeout 20; >>> } >>> >>> Again, this never triggers. >>> >>> >>> 3) Changing the protocol used from C to B >>> >>> Doesn't have any effect on the issue - I'd prefer to use C anyway. >>> >>> >>> Any further ideas on how to track this issue down and fix it? >>> >>> thanks >>> >>> James Masson >>> _______________________________________________ >>> drbd-user mailing list >>> drbd-user at lists.linbit.com >>> http://lists.linbit.com/mailman/listinfo/drbd-user >>> > _______________________________________________ > drbd-user mailing list > drbd-user at lists.linbit.com > http://lists.linbit.com/mailman/listinfo/drbd-user