Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Thu, Aug 23, 2012 at 09:45:21AM +0800, simon wrote:
> Hi Lars Ellenberg,
>
> The Master Host has two network cards, eth0 and eth1. Drbd uses eth0. "not
> real dead" means eth0 is dead. ( it can get by ha log). Eth1 can ping good
> but can't login by ssh.
> So I think maybe the linux is panic.
>
> Eth0 is dead, but drbd can't detect it and return immediately. Why?
As I said, most likely because eth0 still was not that dead as you think it was.
And read again what I said about fencing and stonith.
> Thanks.
Cheers.
> Date: Tue, 21 Aug 2012 12:50:12 +0200
> From: Lars Ellenberg <lars.ellenberg at linbit.com>
> Subject: Re: [DRBD-user] Drbd : PingAsk timeout, about 10 mins.
> To: drbd-user at lists.linbit.com
> Message-ID: <20120821105012.GG20059 at soda.linbit>
> Content-Type: text/plain; charset=utf-8
>
> On Tue, Aug 21, 2012 at 03:40:34PM +0800, simon wrote:
> > Hi Pascal,
> >
> >
> >
> > I can?t reproduce the error because the condition that it issues is
> > very especially. The Master host is in the ?not real dead? status.
> > ( I doubt it is Linux?s panic). The TCP stack maybe is bad in Master
> > host. Now I don?t want to avoid it because I can?t reproduce it. I
> > only want to succeed to switch form Master to Slave so that my
> > service can be supplied normally. But I can?t right to switch because
> > of the 10 minutes delay of Drbd.
>
> Well. If it was "not real dead", then I'd suspect that the DRBD
> connection was still "sort of up", and thus DRBD saw the other node as
> Primary still, and correctly refused to be promoted locally.
>
>
> To have your cluster recover from a "almost but not quite dead node"
> scenario, you need to add stonith aka node level fencing to your
> cluster stack.
>
>
> > I run ?drbdsetup 0 show? on my host, it shows as following,
> >
> > disk {
> > size 0s _is_default; # bytes
> > on-io-error detach;
> > fencing dont-care _is_default;
> > max-bio-bvecs 0 _is_default;
> > }
> >
> > net {
> > timeout 60 _is_default; # 1/10 seconds
> > max-epoch-size 2048 _is_default;
> > max-buffers 2048 _is_default;
> > unplug-watermark 128 _is_default;
> > connect-int 10 _is_default; # seconds
> > ping-int 10 _is_default; # seconds
> > sndbuf-size 0 _is_default; # bytes
> > rcvbuf-size 0 _is_default; # bytes
> > ko-count 0 _is_default;
> > allow-two-primaries;
>
>
> Uh. You are sure about that?
>
> Two primaries, and dont-care for fencing?
>
> You are aware that you just subscribed to data corruption, right?
>
> If you want two primaries, you MUST have proper fencing,
> on both the cluster level (stonith) and the drbd level (fencing
> resource-and-stonith; fence-peer handler: e.g. crm-fence-peer.sh).
>
> > after-sb-0pri discard-least-changes;
> > after-sb-1pri discard-secondary;
>
> And here you configure automatic data loss.
> Which is ok, as long as you are aware of that and actually mean it...
>
>
> >
> > after-sb-2pri disconnect _is_default;
> > rr-conflict disconnect _is_default;
> > ping-timeout 5 _is_default; # 1/10 seconds
> > }
> >
> > syncer {
> > rate 102400k; # bytes/second
> > after -1 _is_default;
> > al-extents 257;
> > }
> >
> > protocol C;
> > _this_host {
> > device minor 0;
> > disk "/dev/cciss/c0d0p7";
> > meta-disk internal;
> > address ipv4 172.17.5.152:7900;
> > }
> >
> > _remote_host {
> > address ipv4 172.17.5.151:7900;
> > }
> >
> >
> >
> >
> >
> > In the list , there is ?timeout 60 _is_default; # 1/10
> seconds?.
>
> Then guess what, maybe the timeout did not trigger,
> because the peer was still "sort of" responsive?
--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com
DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list -- I'm subscribed