Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Tue, Aug 21, 2012 at 03:40:34PM +0800, simon wrote: > Hi Pascal, > > > > I can’t reproduce the error because the condition that it issues is > very especially. The Master host is in the “not real dead” status. > ( I doubt it is Linux’s panic). The TCP stack maybe is bad in Master > host. Now I don’t want to avoid it because I can’t reproduce it. I > only want to succeed to switch form Master to Slave so that my > service can be supplied normally. But I can’t right to switch because > of the 10 minutes delay of Drbd. Well. If it was "not real dead", then I'd suspect that the DRBD connection was still "sort of up", and thus DRBD saw the other node as Primary still, and correctly refused to be promoted locally. To have your cluster recover from a "almost but not quite dead node" scenario, you need to add stonith aka node level fencing to your cluster stack. > I run “drbdsetup 0 show” on my host, it shows as following, > > disk { > size 0s _is_default; # bytes > on-io-error detach; > fencing dont-care _is_default; > max-bio-bvecs 0 _is_default; > } > > net { > timeout 60 _is_default; # 1/10 seconds > max-epoch-size 2048 _is_default; > max-buffers 2048 _is_default; > unplug-watermark 128 _is_default; > connect-int 10 _is_default; # seconds > ping-int 10 _is_default; # seconds > sndbuf-size 0 _is_default; # bytes > rcvbuf-size 0 _is_default; # bytes > ko-count 0 _is_default; > allow-two-primaries; Uh. You are sure about that? Two primaries, and dont-care for fencing? You are aware that you just subscribed to data corruption, right? If you want two primaries, you MUST have proper fencing, on both the cluster level (stonith) and the drbd level (fencing resource-and-stonith; fence-peer handler: e.g. crm-fence-peer.sh). > after-sb-0pri discard-least-changes; > after-sb-1pri discard-secondary; And here you configure automatic data loss. Which is ok, as long as you are aware of that and actually mean it... > > after-sb-2pri disconnect _is_default; > rr-conflict disconnect _is_default; > ping-timeout 5 _is_default; # 1/10 seconds > } > > syncer { > rate 102400k; # bytes/second > after -1 _is_default; > al-extents 257; > } > > protocol C; > _this_host { > device minor 0; > disk "/dev/cciss/c0d0p7"; > meta-disk internal; > address ipv4 172.17.5.152:7900; > } > > _remote_host { > address ipv4 172.17.5.151:7900; > } > > > > > > In the list , there is “timeout 60 _is_default; # 1/10 seconds”. Then guess what, maybe the timeout did not trigger, because the peer was still "sort of" responsive? -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. __ please don't Cc me, but send to list -- I'm subscribed