Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On 02/11/2011 06:13 PM, Dan Gavenda wrote: > Hi, > > I am a newbie to drbd/heartbeat. We have two servers/nodes and have it working > w/ one exception. When the master loses networking, it stays primary. When it > regains networking, it causes a 2pri split brain since the slave took primary. > What is the best method to have the master change states when it loses > networking? We have tried using dopd and ipfail which don't seem to do that > probably due to lack of proper configuration. This is were help would be > greatly appreciated. Dan, first of all: change your node names. I mean do it NOW. Name them joe and jane, alice and bob, bert and ernie, statler and waldorf, whatever, but DO NOT name them primary and secondary. And don't do that ever again. Why? Good luck troubleshooting a DRBD issue at 3am where you node named "secondary" is Primary, Diskless, its UpToDate node is Secondary, but, well, it's named primary. Did I get you confused? Well it's not 3am and you're not sleep deprived. Secondly, as you're a newbie, please throw away your haresources configuration straight away and install Pacemaker. You have nothing to gain from learning how to do haresources configs, they're outdated and obsolete. Do it right. You can continue to use heartbeat for cluster communications if you prefer (and dopd for resource fencing), but do install Pacemaker. > Failover works properly when the master is halted/rebooted. The problem > happens only when it loses networking. > > > > Here are the configs. > ====================== > /etc/ha.d/haresources > primary 172.20.20.234 drbddisk::replicate-volume > Filesystem::/dev/drbd0::/replicate-volume::ext3 > > ====================== > /etc/ha.d/ha.cf > debugfile /var/log/ha-debug > > logfile /var/log/ha-log > logfacility local0 > keepalive 1 > deadtime 20 > warntime 5 > initdead 60 > udpport 694 > ucast eth0 172.20.20.35 > ucast eth0 172.20.20.235 > bcast eth1 man cl_status, look for "listhblinks" and "hblinkstatus". Figure out if both your links are actually up and the nodes can see each other. > auto_failback on > node primary > node secondary > # ping_group always_up_nodes 172.20.20.1 > #respawn hacluster /usr/lib/heartbeat/ipfail > #ping 172.20.20.1 > auto_failback off > respawn hacluster /usr/lib/heartbeat/dopd Is this a 32-bit system? Sure that path is correct for your platform? > apiauth dopd gid=haclient uid=hacluster > ====================== > /etc/drbd.conf > global { usage-count yes; } > > common { > protocol C; > } > > > resource replicate-volume { > disk { > fencing resource-only; > } > > handlers { > # split-brain "/usr/lib/drbd/notify-split-brain.sh root"; > > pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh"; > pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh"; > local-io-error "/usr/lib/drbd/notify-io-error.sh"; > split-brain "/usr/lib/drbd/notify-split-brain.sh root"; > out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root"; > > outdate-peer "/usr/lib/heartbeat/drbd-peer-outdater"; This should be "fence-peer" now; "outdate-peer" is a compat alias. What DRBD version is this? > } > > net { > after-sb-0pri discard-younger-primary; > after-sb-1pri discard-secondary; > after-sb-2pri call-pri-lost-after-sb; Please, no. You're signing up for losing data after split brain. Leave these at the defaults. You're emulating DRBD 0.7 behavior, which is the wrong thing to do (DRBD has gotten much smarter since). > } > > startup { > wfc-timeout 60; > } > > syncer { > rate 12M; > } > > on primary { > device /dev/drbd0; > disk /dev/sdb1; > address 172.20.20.35:7788; > meta-disk internal; > } > > on secondary { > device /dev/drbd0; > disk /dev/sdb1; > address 172.20.20.235:7788; > meta-disk internal; > } > > } Now, if all your links are actually up, then DRBD should do as you expect (replication link dies, DRBD's Secondary node gets outdated, promotion fails), and my current hunch is that your links are fishy. But, really, please go back to square one and get this set up with Pacemaker, and once that is set up test link failure. Cheers, Florian -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 262 bytes Desc: OpenPGP digital signature URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20110211/d951d2c0/attachment.pgp>