Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Thu, Aug 23, 2012 at 09:45:21AM +0800, simon wrote: > Hi Lars Ellenberg, > > The Master Host has two network cards, eth0 and eth1. Drbd uses eth0. "not > real dead" means eth0 is dead. ( it can get by ha log). Eth1 can ping good > but can't login by ssh. > So I think maybe the linux is panic. > > Eth0 is dead, but drbd can't detect it and return immediately. Why? As I said, most likely because eth0 still was not that dead as you think it was. And read again what I said about fencing and stonith. > Thanks. Cheers. > Date: Tue, 21 Aug 2012 12:50:12 +0200 > From: Lars Ellenberg <lars.ellenberg at linbit.com> > Subject: Re: [DRBD-user] Drbd : PingAsk timeout, about 10 mins. > To: drbd-user at lists.linbit.com > Message-ID: <20120821105012.GG20059 at soda.linbit> > Content-Type: text/plain; charset=utf-8 > > On Tue, Aug 21, 2012 at 03:40:34PM +0800, simon wrote: > > Hi Pascal, > > > > > > > > I can?t reproduce the error because the condition that it issues is > > very especially. The Master host is in the ?not real dead? status. > > ( I doubt it is Linux?s panic). The TCP stack maybe is bad in Master > > host. Now I don?t want to avoid it because I can?t reproduce it. I > > only want to succeed to switch form Master to Slave so that my > > service can be supplied normally. But I can?t right to switch because > > of the 10 minutes delay of Drbd. > > Well. If it was "not real dead", then I'd suspect that the DRBD > connection was still "sort of up", and thus DRBD saw the other node as > Primary still, and correctly refused to be promoted locally. > > > To have your cluster recover from a "almost but not quite dead node" > scenario, you need to add stonith aka node level fencing to your > cluster stack. > > > > I run ?drbdsetup 0 show? on my host, it shows as following, > > > > disk { > > size 0s _is_default; # bytes > > on-io-error detach; > > fencing dont-care _is_default; > > max-bio-bvecs 0 _is_default; > > } > > > > net { > > timeout 60 _is_default; # 1/10 seconds > > max-epoch-size 2048 _is_default; > > max-buffers 2048 _is_default; > > unplug-watermark 128 _is_default; > > connect-int 10 _is_default; # seconds > > ping-int 10 _is_default; # seconds > > sndbuf-size 0 _is_default; # bytes > > rcvbuf-size 0 _is_default; # bytes > > ko-count 0 _is_default; > > allow-two-primaries; > > > Uh. You are sure about that? > > Two primaries, and dont-care for fencing? > > You are aware that you just subscribed to data corruption, right? > > If you want two primaries, you MUST have proper fencing, > on both the cluster level (stonith) and the drbd level (fencing > resource-and-stonith; fence-peer handler: e.g. crm-fence-peer.sh). > > > after-sb-0pri discard-least-changes; > > after-sb-1pri discard-secondary; > > And here you configure automatic data loss. > Which is ok, as long as you are aware of that and actually mean it... > > > > > > after-sb-2pri disconnect _is_default; > > rr-conflict disconnect _is_default; > > ping-timeout 5 _is_default; # 1/10 seconds > > } > > > > syncer { > > rate 102400k; # bytes/second > > after -1 _is_default; > > al-extents 257; > > } > > > > protocol C; > > _this_host { > > device minor 0; > > disk "/dev/cciss/c0d0p7"; > > meta-disk internal; > > address ipv4 172.17.5.152:7900; > > } > > > > _remote_host { > > address ipv4 172.17.5.151:7900; > > } > > > > > > > > > > > > In the list , there is ?timeout 60 _is_default; # 1/10 > seconds?. > > Then guess what, maybe the timeout did not trigger, > because the peer was still "sort of" responsive? -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. __ please don't Cc me, but send to list -- I'm subscribed