Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Ok, I have both nodes correct and up. The problem still exist though. When server-1 loses networking, it still thinks it's online and it also thinks server-2 is offline. I am using ipfail in ha.cf. Is this the correct method when using pacemaker/heartbeat and drbd? Feb 27 17:13:17 server-1 heartbeat: [3056]: info: Link server-2:eth0 dead. Feb 27 17:13:17 server-1 ipfail: [3142]: info: Link Status update: Link server-2/eth0 now has status dead I thought it was supposed to ping the ping_group always_up_nodes 172.20.20.1 to determine if it's up. It doesn't seem to be doing that or disregarding it. ----- Original Message ---- From: Dan Gavenda <djgavenda2 at yahoo.com> To: drbd-user at lists.linbit.com Sent: Fri, February 25, 2011 3:20:17 PM Subject: Re: [DRBD-user] Node w/ lost networking remains primary Florian, You were right. cl_status hblinkstatus replicate-volume eth0 cl_status[14856]: 2011/02/25_15:17:03 ERROR: Cannot signon with heartbeat cl_status[14856]: 2011/02/25_15:17:03 ERROR: REASON: hb_api_signon: Can't initiate connection to heartbeat How do I correct this? The /var/log/ha-log is filled with: eb 25 15:16:11 server-1 heartbeat: [11556]: ERROR: should_drop_message: attempted replay attack [server-1]? [gen = 1294354405, curgen = 1298668153] Feb 25 15:16:11 server-1 heartbeat: [11556]: ERROR: should_drop_message: attempted replay attack [server-1]? [gen = 1294354405, curgen = 1298668153] I'm using debian with drbd: 8.3.7 and pacemaker: 1.0.9.1+hg15626-1~bpo50+1 cat /etc/drbd.conf global { usage-count yes; } common { protocol C; } resource replicate-volume { handlers { split-brain "/usr/lib/drbd/notify-split-brain.sh root"; } startup { wfc-timeout 60; } syncer { rate 12M; } on server-1 { device /dev/drbd0; disk /dev/sdb1; address 172.20.20.220:7788; meta-disk internal; } on server-2 { device /dev/drbd0; disk /dev/sdb1; address 172.20.20.221:7788; meta-disk internal; } } node server-1 \ attributes standby="off" node server-2 primitive resDRBD ocf:heartbeat:drbd \ params drbd_resource="replicate-volume" \ op start interval="0" timeout="240s" \ op stop interval="0" timeout="100s" primitive resFilesystem ocf:heartbeat:Filesystem \ params fstype="ext3" device="/dev/drbd0" directory="/replicate-volume" \ meta target-role="Started" \ op start interval="0" timeout="240s" \ op stop interval="0" timeout="100s" ms msDRBD resDRBD \ meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" globally_unique="false" location locMaster msDRBD 100: server-1 colocation colFSDRBD inf: resFilesystem msDRBD:Master order orderDRBDFS inf: msDRBD:promote resFilesystem:start property $id="cib-bootstrap-options" \ dc-version="1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b" \ cluster-infrastructure="openais" \ expected-quorum-votes="4" \ stonith-enabled="false" \ no-quorum-policy="ignore" ----- Original Message ---- From: Florian Haas <florian.haas at linbit.com> To: drbd-user at lists.linbit.com Sent: Fri, February 11, 2011 1:38:32 PM Subject: Re: [DRBD-user] Node w/ lost networking remains primary On 02/11/2011 06:13 PM, Dan Gavenda wrote: > Hi, > > I am a newbie to drbd/heartbeat. We have two servers/nodes and have it >working > > w/ one exception. When the master loses networking, it stays primary. When >it > > regains networking, it causes a 2pri split brain since the slave took >primary. > > What is the best method to have the master change states when it loses > networking? We have tried using dopd and ipfail which don't seem to do that > probably due to lack of proper configuration. This is were help would be > greatly appreciated. Dan, first of all: change your node names. I mean do it NOW. Name them joe and jane, alice and bob, bert and ernie, statler and waldorf, whatever, but DO NOT name them primary and secondary. And don't do that ever again. Why? Good luck troubleshooting a DRBD issue at 3am where you node named "secondary" is Primary, Diskless, its UpToDate node is Secondary, but, well, it's named primary. Did I get you confused? Well it's not 3am and you're not sleep deprived. Secondly, as you're a newbie, please throw away your haresources configuration straight away and install Pacemaker. You have nothing to gain from learning how to do haresources configs, they're outdated and obsolete. Do it right. You can continue to use heartbeat for cluster communications if you prefer (and dopd for resource fencing), but do install Pacemaker. > Failover works properly when the master is halted/rebooted. The problem > happens only when it loses networking. > > > > Here are the configs. > ====================== > /etc/ha.d/haresources > primary 172.20.20.234 drbddisk::replicate-volume > Filesystem::/dev/drbd0::/replicate-volume::ext3 > > ====================== > /etc/ha.d/ha.cf > debugfile /var/log/ha-debug > > logfile /var/log/ha-log > logfacility local0 > keepalive 1 > deadtime 20 > warntime 5 > initdead 60 > udpport 694 > ucast eth0 172.20.20.35 > ucast eth0 172.20.20.235 > bcast eth1 man cl_status, look for "listhblinks" and "hblinkstatus". Figure out if both your links are actually up and the nodes can see each other. > auto_failback on > node primary > node secondary > # ping_group always_up_nodes 172.20.20.1 > #respawn hacluster /usr/lib/heartbeat/ipfail > #ping 172.20.20.1 > auto_failback off > respawn hacluster /usr/lib/heartbeat/dopd Is this a 32-bit system? Sure that path is correct for your platform? > apiauth dopd gid=haclient uid=hacluster > ====================== > /etc/drbd.conf > global { usage-count yes; } > > common { > protocol C; > } > > > resource replicate-volume { > disk { > fencing resource-only; > } > > handlers { > # split-brain "/usr/lib/drbd/notify-split-brain.sh root"; > > pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh"; > pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh"; > local-io-error "/usr/lib/drbd/notify-io-error.sh"; > split-brain "/usr/lib/drbd/notify-split-brain.sh root"; > out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root"; > > outdate-peer "/usr/lib/heartbeat/drbd-peer-outdater"; This should be "fence-peer" now; "outdate-peer" is a compat alias. What DRBD version is this? > } > > net { > after-sb-0pri discard-younger-primary; > after-sb-1pri discard-secondary; > after-sb-2pri call-pri-lost-after-sb; Please, no. You're signing up for losing data after split brain. Leave these at the defaults. You're emulating DRBD 0.7 behavior, which is the wrong thing to do (DRBD has gotten much smarter since). > } > > startup { > wfc-timeout 60; > } > > syncer { > rate 12M; > } > > on primary { > device /dev/drbd0; > disk /dev/sdb1; > address 172.20.20.35:7788; > meta-disk internal; > } > > on secondary { > device /dev/drbd0; > disk /dev/sdb1; > address 172.20.20.235:7788; > meta-disk internal; > } > > } Now, if all your links are actually up, then DRBD should do as you expect (replication link dies, DRBD's Secondary node gets outdated, promotion fails), and my current hunch is that your links are fishy. But, really, please go back to square one and get this set up with Pacemaker, and once that is set up test link failure. Cheers, Florian _______________________________________________ drbd-user mailing list drbd-user at lists.linbit.com http://lists.linbit.com/mailman/listinfo/drbd-user