[DRBD-user] Drbd : PingAsk timeout, about 10 mins.

Tue Aug 21 12:50:12 CEST 2012

On Tue, Aug 21, 2012 at 03:40:34PM +0800, simon wrote:
> Hi Pascal,
> 
>  
> 
> I can’t reproduce the error because the condition that it issues is
> very especially.  The Master host is in the   “not real dead” status.
> ( I doubt it is Linux’s panic). The TCP stack maybe is bad in Master
> host. Now I don’t want to avoid it because I can’t reproduce it. I
> only want to succeed to  switch form Master to Slave so that my
> service can be supplied normally. But I can’t right to switch because
> of the 10 minutes delay  of Drbd.

Well. If it was "not real dead", then I'd suspect that the DRBD
connection was still "sort of up", and thus DRBD saw the other node as
Primary still, and correctly refused to be promoted locally.

To have your cluster recover from a "almost but not quite dead node"
scenario, you need to add stonith aka node level fencing to your
cluster stack.

> I run “drbdsetup 0 show” on my host, it shows as following,
> 
> disk {
>         size                    0s _is_default; # bytes
>         on-io-error             detach;
>         fencing                 dont-care _is_default;
>         max-bio-bvecs           0 _is_default;
> }
> 
> net {
>         timeout                 60 _is_default; # 1/10 seconds
>         max-epoch-size          2048 _is_default;
>         max-buffers             2048 _is_default;
>         unplug-watermark        128 _is_default;
>         connect-int             10 _is_default; # seconds
>         ping-int                10 _is_default; # seconds
>         sndbuf-size             0 _is_default; # bytes
>         rcvbuf-size             0 _is_default; # bytes
>         ko-count                0 _is_default;
>         allow-two-primaries;

Uh. You are sure about that?

Two primaries, and dont-care for fencing?

You are aware that you just subscribed to data corruption, right?

If you want two primaries, you MUST have proper fencing,
on both the cluster level (stonith) and the drbd level (fencing
resource-and-stonith; fence-peer handler: e.g. crm-fence-peer.sh).

>         after-sb-0pri           discard-least-changes;
>         after-sb-1pri           discard-secondary;

And here you configure automatic data loss.
Which is ok, as long as you are aware of that and actually mean it...

> 
>         after-sb-2pri           disconnect _is_default;
>         rr-conflict             disconnect _is_default;
>         ping-timeout            5 _is_default; # 1/10 seconds
> }
> 
> syncer {
>         rate                    102400k; # bytes/second
>         after                   -1 _is_default;
>         al-extents              257;
> }
> 
> protocol C;
> _this_host {
>         device                  minor 0;
>         disk                    "/dev/cciss/c0d0p7";
>         meta-disk               internal;
>         address                 ipv4 172.17.5.152:7900;
> }
> 
> _remote_host {
>         address                 ipv4 172.17.5.151:7900;
> }
> 
>  
> 
>  
> 
> In the list ,  there is  “timeout                 60 _is_default; # 1/10 seconds”.

Then guess what, maybe the timeout did not trigger,
because the peer was still "sort of" responsive?

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed