[DRBD-user] Drbd : PingAsk timeout, about 10 mins.

Thu Aug 23 12:19:48 CEST 2012

On Thu, Aug 23, 2012 at 09:45:21AM +0800, simon wrote:
> Hi Lars Ellenberg,
> 
> The Master Host has two network cards, eth0 and eth1. Drbd uses eth0. "not
> real dead" means eth0 is dead. ( it can get by ha log). Eth1 can ping good
> but can't login by ssh.
> So I think maybe the linux is panic.
> 
> Eth0 is dead, but drbd can't detect it and return immediately. Why?

As I said, most likely because eth0 still was not that dead as you think it was.

And read again what I said about fencing and stonith.

> Thanks.

Cheers.

> Date: Tue, 21 Aug 2012 12:50:12 +0200
> From: Lars Ellenberg <lars.ellenberg at linbit.com>
> Subject: Re: [DRBD-user] Drbd : PingAsk timeout, about 10 mins.
> To: drbd-user at lists.linbit.com
> Message-ID: <20120821105012.GG20059 at soda.linbit>
> Content-Type: text/plain; charset=utf-8
> 
> On Tue, Aug 21, 2012 at 03:40:34PM +0800, simon wrote:
> > Hi Pascal,
> > 
> >  
> > 
> > I can?t reproduce the error because the condition that it issues is
> > very especially.  The Master host is in the   ?not real dead? status.
> > ( I doubt it is Linux?s panic). The TCP stack maybe is bad in Master
> > host. Now I don?t want to avoid it because I can?t reproduce it. I
> > only want to succeed to  switch form Master to Slave so that my
> > service can be supplied normally. But I can?t right to switch because
> > of the 10 minutes delay  of Drbd.
> 
> Well. If it was "not real dead", then I'd suspect that the DRBD
> connection was still "sort of up", and thus DRBD saw the other node as
> Primary still, and correctly refused to be promoted locally.
> 
> 
> To have your cluster recover from a "almost but not quite dead node"
> scenario, you need to add stonith aka node level fencing to your
> cluster stack.
> 
> 
> > I run ?drbdsetup 0 show? on my host, it shows as following,
> > 
> > disk {
> >         size                    0s _is_default; # bytes
> >         on-io-error             detach;
> >         fencing                 dont-care _is_default;
> >         max-bio-bvecs           0 _is_default;
> > }
> > 
> > net {
> >         timeout                 60 _is_default; # 1/10 seconds
> >         max-epoch-size          2048 _is_default;
> >         max-buffers             2048 _is_default;
> >         unplug-watermark        128 _is_default;
> >         connect-int             10 _is_default; # seconds
> >         ping-int                10 _is_default; # seconds
> >         sndbuf-size             0 _is_default; # bytes
> >         rcvbuf-size             0 _is_default; # bytes
> >         ko-count                0 _is_default;
> >         allow-two-primaries;
> 
> 
> Uh. You are sure about that?
> 
> Two primaries, and dont-care for fencing?
> 
> You are aware that you just subscribed to data corruption, right?
> 
> If you want two primaries, you MUST have proper fencing,
> on both the cluster level (stonith) and the drbd level (fencing
> resource-and-stonith; fence-peer handler: e.g. crm-fence-peer.sh).
> 
> >         after-sb-0pri           discard-least-changes;
> >         after-sb-1pri           discard-secondary;
> 
> And here you configure automatic data loss.
> Which is ok, as long as you are aware of that and actually mean it...
> 
> 
> > 
> >         after-sb-2pri           disconnect _is_default;
> >         rr-conflict             disconnect _is_default;
> >         ping-timeout            5 _is_default; # 1/10 seconds
> > }
> > 
> > syncer {
> >         rate                    102400k; # bytes/second
> >         after                   -1 _is_default;
> >         al-extents              257;
> > }
> > 
> > protocol C;
> > _this_host {
> >         device                  minor 0;
> >         disk                    "/dev/cciss/c0d0p7";
> >         meta-disk               internal;
> >         address                 ipv4 172.17.5.152:7900;
> > }
> > 
> > _remote_host {
> >         address                 ipv4 172.17.5.151:7900;
> > }
> > 
> >  
> > 
> >  
> > 
> > In the list ,  there is  ?timeout                 60 _is_default; # 1/10
> seconds?.
> 
> Then guess what, maybe the timeout did not trigger,
> because the peer was still "sort of" responsive?

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed