[DRBD-user] Node w/ lost networking remains primary

Fri Feb 11 20:38:32 CET 2011

On 02/11/2011 06:13 PM, Dan Gavenda wrote:
> Hi, 
> 
> I am a newbie to drbd/heartbeat.  We have two servers/nodes and have it  working 
> w/ one exception.  When the master loses networking, it stays  primary.  When it 
> regains networking, it causes a 2pri split brain since  the slave took primary.  
> What is the best method to have the master  change states when it loses 
> networking?  We have tried using dopd and  ipfail which don't seem to do that 
> probably due to lack of proper  configuration.  This is were help would be 
> greatly appreciated. 

Dan,

first of all: change your node names. I mean do it NOW. Name them joe
and jane, alice and bob, bert and ernie, statler and waldorf, whatever,
but DO NOT name them primary and secondary. And don't do that ever again.

Why? Good luck troubleshooting a DRBD issue at 3am where you node named
"secondary" is Primary, Diskless, its UpToDate node is Secondary, but,
well, it's named primary. Did I get you confused? Well it's not 3am and
you're not sleep deprived.

Secondly, as you're a newbie, please throw away your haresources
configuration straight away and install Pacemaker. You have nothing to
gain from learning how to do haresources configs, they're outdated and
obsolete. Do it right. You can continue to use heartbeat for cluster
communications if you prefer (and dopd for resource fencing), but do
install Pacemaker.

> Failover works properly when the master is halted/rebooted.  The problem  
> happens only when it loses networking.
> 
> 
> 
> Here are the configs. 
> ====================== 
> /etc/ha.d/haresources 
> primary 172.20.20.234 drbddisk::replicate-volume  
> Filesystem::/dev/drbd0::/replicate-volume::ext3 
> 
> ====================== 
> /etc/ha.d/ha.cf 
> debugfile /var/log/ha-debug 
> 
>     logfile /var/log/ha-log 
>     logfacility     local0 
>     keepalive 1 
>     deadtime 20 
>     warntime 5 
>     initdead 60 
>     udpport 694 
>     ucast eth0 172.20.20.35 
>     ucast eth0 172.20.20.235 
>         bcast eth1

man cl_status, look for "listhblinks" and "hblinkstatus". Figure out if
both your links are actually up and the nodes can see each other.

>     auto_failback on 
>     node primary 
>     node secondary 
> #        ping_group always_up_nodes 172.20.20.1 
> #respawn hacluster /usr/lib/heartbeat/ipfail 
> #ping 172.20.20.1 
> auto_failback off 
> respawn hacluster /usr/lib/heartbeat/dopd 

Is this a 32-bit system? Sure that path is correct for your platform?

> apiauth dopd gid=haclient uid=hacluster 
> ====================== 
> /etc/drbd.conf 
> global { usage-count yes; } 
> 
> common { 
>   protocol C; 
> } 
> 
> 
> resource replicate-volume { 
>   disk { 
>     fencing   resource-only; 
>   } 
> 
>   handlers { 
> #    split-brain "/usr/lib/drbd/notify-split-brain.sh root"; 
> 
>      pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh"; 
>      pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh"; 
>      local-io-error "/usr/lib/drbd/notify-io-error.sh"; 
>      split-brain "/usr/lib/drbd/notify-split-brain.sh root"; 
>      out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root"; 
> 
>      outdate-peer "/usr/lib/heartbeat/drbd-peer-outdater"; 

This should be "fence-peer" now; "outdate-peer" is a compat alias. What
DRBD version is this?

>   } 
> 
>   net { 
>     after-sb-0pri discard-younger-primary; 
>     after-sb-1pri discard-secondary; 
>     after-sb-2pri call-pri-lost-after-sb; 

Please, no. You're signing up for losing data after split brain. Leave
these at the defaults. You're emulating DRBD 0.7 behavior, which is the
wrong thing to do (DRBD has gotten much smarter since).

>   } 
> 
>   startup { 
>     wfc-timeout 60; 
>   } 
> 
> syncer { 
>     rate 12M; 
> } 
> 
>   on primary { 
>     device     /dev/drbd0; 
>     disk       /dev/sdb1; 
>     address    172.20.20.35:7788; 
>     meta-disk  internal; 
>   } 
> 
>   on secondary { 
>     device    /dev/drbd0; 
>     disk      /dev/sdb1; 
>     address   172.20.20.235:7788; 
>     meta-disk internal; 
>   } 
> 
> } 

Now, if all your links are actually up, then DRBD should do as you
expect (replication link dies, DRBD's Secondary node gets outdated,
promotion fails), and my current hunch is that your links are fishy.
But, really, please go back to square one and get this set up with
Pacemaker, and once that is set up test link failure.

Cheers,
Florian

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 262 bytes
Desc: OpenPGP digital signature
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20110211/d951d2c0/attachment.pgp>