[DRBD-user] Drbd : PingAsk timeout, about 10 mins.

Thu Aug 23 03:45:21 CEST 2012

Hi Lars Ellenberg,

The Master Host has two network cards, eth0 and eth1. Drbd uses eth0. "not
real dead" means eth0 is dead. ( it can get by ha log). Eth1 can ping good
but can't login by ssh.
So I think maybe the linux is panic.

Eth0 is dead, but drbd can't detect it and return immediately. Why?

Thanks.

-----邮件原件-----
发件人: drbd-user-bounces at lists.linbit.com
[mailto:drbd-user-bounces at lists.linbit.com] 代表
drbd-user-request at lists.linbit.com
发送时间: 2012年8月22日 星期三 18:00
收件人: drbd-user at lists.linbit.com
主题: drbd-user Digest, Vol 97, Issue 23

Send drbd-user mailing list submissions to
	drbd-user at lists.linbit.com

To subscribe or unsubscribe via the World Wide Web, visit
	http://lists.linbit.com/mailman/listinfo/drbd-user
or, via email, send a message with subject or body 'help' to
	drbd-user-request at lists.linbit.com

You can reach the person managing the list at
	drbd-user-owner at lists.linbit.com

When replying, please edit your Subject line so it is more specific
than "Re: Contents of drbd-user digest..."

Today's Topics:

   1. Re: Drbd : PingAsk timeout, about 10 mins. (Lars Ellenberg)

----------------------------------------------------------------------

Message: 1
Date: Tue, 21 Aug 2012 12:50:12 +0200
From: Lars Ellenberg <lars.ellenberg at linbit.com>
Subject: Re: [DRBD-user] Drbd : PingAsk timeout, about 10 mins.
To: drbd-user at lists.linbit.com
Message-ID: <20120821105012.GG20059 at soda.linbit>
Content-Type: text/plain; charset=utf-8

On Tue, Aug 21, 2012 at 03:40:34PM +0800, simon wrote:
> Hi Pascal,
> 
>  
> 
> I can?t reproduce the error because the condition that it issues is
> very especially.  The Master host is in the   ?not real dead? status.
> ( I doubt it is Linux?s panic). The TCP stack maybe is bad in Master
> host. Now I don?t want to avoid it because I can?t reproduce it. I
> only want to succeed to  switch form Master to Slave so that my
> service can be supplied normally. But I can?t right to switch because
> of the 10 minutes delay  of Drbd.

Well. If it was "not real dead", then I'd suspect that the DRBD
connection was still "sort of up", and thus DRBD saw the other node as
Primary still, and correctly refused to be promoted locally.

To have your cluster recover from a "almost but not quite dead node"
scenario, you need to add stonith aka node level fencing to your
cluster stack.

> I run ?drbdsetup 0 show? on my host, it shows as following,
> 
> disk {
>         size                    0s _is_default; # bytes
>         on-io-error             detach;
>         fencing                 dont-care _is_default;
>         max-bio-bvecs           0 _is_default;
> }
> 
> net {
>         timeout                 60 _is_default; # 1/10 seconds
>         max-epoch-size          2048 _is_default;
>         max-buffers             2048 _is_default;
>         unplug-watermark        128 _is_default;
>         connect-int             10 _is_default; # seconds
>         ping-int                10 _is_default; # seconds
>         sndbuf-size             0 _is_default; # bytes
>         rcvbuf-size             0 _is_default; # bytes
>         ko-count                0 _is_default;
>         allow-two-primaries;

Uh. You are sure about that?

Two primaries, and dont-care for fencing?

You are aware that you just subscribed to data corruption, right?

If you want two primaries, you MUST have proper fencing,
on both the cluster level (stonith) and the drbd level (fencing
resource-and-stonith; fence-peer handler: e.g. crm-fence-peer.sh).

>         after-sb-0pri           discard-least-changes;
>         after-sb-1pri           discard-secondary;

And here you configure automatic data loss.
Which is ok, as long as you are aware of that and actually mean it...

> 
>         after-sb-2pri           disconnect _is_default;
>         rr-conflict             disconnect _is_default;
>         ping-timeout            5 _is_default; # 1/10 seconds
> }
> 
> syncer {
>         rate                    102400k; # bytes/second
>         after                   -1 _is_default;
>         al-extents              257;
> }
> 
> protocol C;
> _this_host {
>         device                  minor 0;
>         disk                    "/dev/cciss/c0d0p7";
>         meta-disk               internal;
>         address                 ipv4 172.17.5.152:7900;
> }
> 
> _remote_host {
>         address                 ipv4 172.17.5.151:7900;
> }
> 
>  
> 
>  
> 
> In the list ,  there is  ?timeout                 60 _is_default; # 1/10
seconds?.

Then guess what, maybe the timeout did not trigger,
because the peer was still "sort of" responsive?

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD? and LINBIT? are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed

------------------------------

_______________________________________________
drbd-user mailing list
drbd-user at lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user

End of drbd-user Digest, Vol 97, Issue 23
*****************************************