[DRBD-user] Node w/ lost networking remains primary

Mon Feb 28 00:28:19 CET 2011

Ok, I have both nodes correct and up.  The problem still exist though.  When 
server-1 loses networking, it still thinks it's online and it also thinks 
server-2 is offline.  I am using ipfail in ha.cf.  Is this the correct method 
when using pacemaker/heartbeat and drbd?  

Feb 27 17:13:17 server-1 heartbeat: [3056]: info: Link server-2:eth0 dead.
Feb 27 17:13:17 server-1 ipfail: [3142]: info: Link Status update: 
Link server-2/eth0 now has status dead

I thought it was supposed to ping the ping_group always_up_nodes 172.20.20.1 to 
determine if it's up.  It doesn't seem to be doing that or disregarding it.  

----- Original Message ----
From: Dan Gavenda <djgavenda2 at yahoo.com>
To: drbd-user at lists.linbit.com
Sent: Fri, February 25, 2011 3:20:17 PM
Subject: Re: [DRBD-user] Node w/ lost networking remains primary

Florian,

You were right.  

cl_status hblinkstatus replicate-volume eth0
cl_status[14856]: 2011/02/25_15:17:03 ERROR: Cannot signon with heartbeat
cl_status[14856]: 2011/02/25_15:17:03 ERROR: REASON: hb_api_signon: Can't 
initiate connection  to heartbeat

How do I correct this?  The /var/log/ha-log is filled with:
eb 25 15:16:11 server-1 heartbeat: [11556]: ERROR: should_drop_message: 
attempted replay attack [server-1]? [gen = 1294354405, curgen = 1298668153]
Feb 25 15:16:11 server-1 heartbeat: [11556]: ERROR: should_drop_message: 
attempted replay attack [server-1]? [gen = 1294354405, curgen = 1298668153]

I'm using debian with drbd: 8.3.7 and pacemaker: 1.0.9.1+hg15626-1~bpo50+1

cat /etc/drbd.conf
global { usage-count yes; }

common {
  protocol C;
}

resource replicate-volume {

  handlers {
    split-brain "/usr/lib/drbd/notify-split-brain.sh root";
  }

  startup {
    wfc-timeout 60;
  }

syncer {
rate 12M;
}

  on server-1 {
    device     /dev/drbd0;
    disk       /dev/sdb1;
    address    172.20.20.220:7788;
    meta-disk  internal;
  }

  on server-2 {
    device    /dev/drbd0;
    disk      /dev/sdb1;
    address   172.20.20.221:7788;
    meta-disk internal;
  }

}

node server-1 \
attributes standby="off"
node server-2
primitive resDRBD ocf:heartbeat:drbd \
params drbd_resource="replicate-volume" \
op start interval="0" timeout="240s" \
op stop interval="0" timeout="100s"
primitive resFilesystem ocf:heartbeat:Filesystem \
params fstype="ext3" device="/dev/drbd0" directory="/replicate-volume" \
meta target-role="Started" \
op start interval="0" timeout="240s" \
op stop interval="0" timeout="100s"
ms msDRBD resDRBD \
meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" 
notify="true" globally_unique="false"
location locMaster msDRBD 100: server-1
colocation colFSDRBD inf: resFilesystem msDRBD:Master
order orderDRBDFS inf: msDRBD:promote resFilesystem:start
property $id="cib-bootstrap-options" \
dc-version="1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b" \
cluster-infrastructure="openais" \
expected-quorum-votes="4" \
stonith-enabled="false" \
no-quorum-policy="ignore"

----- Original Message ----
From: Florian Haas <florian.haas at linbit.com>
To: drbd-user at lists.linbit.com
Sent: Fri, February 11, 2011 1:38:32 PM
Subject: Re: [DRBD-user] Node w/ lost networking remains primary

On 02/11/2011 06:13 PM, Dan Gavenda wrote:
> Hi, 
> 
> I am a newbie to drbd/heartbeat.  We have two servers/nodes and have it  
>working 
>
> w/ one exception.  When the master loses networking, it stays  primary.  When 
>it 
>
> regains networking, it causes a 2pri split brain since  the slave took 
>primary.  
>
> What is the best method to have the master  change states when it loses 
> networking?  We have tried using dopd and  ipfail which don't seem to do that 
> probably due to lack of proper  configuration.  This is were help would be 
> greatly appreciated. 

Dan,

first of all: change your node names. I mean do it NOW. Name them joe
and jane, alice and bob, bert and ernie, statler and waldorf, whatever,
but DO NOT name them primary and secondary. And don't do that ever again.

Why? Good luck troubleshooting a DRBD issue at 3am where you node named
"secondary" is Primary, Diskless, its UpToDate node is Secondary, but,
well, it's named primary. Did I get you confused? Well it's not 3am and
you're not sleep deprived.

Secondly, as you're a newbie, please throw away your haresources
configuration straight away and install Pacemaker. You have nothing to
gain from learning how to do haresources configs, they're outdated and
obsolete. Do it right. You can continue to use heartbeat for cluster
communications if you prefer (and dopd for resource fencing), but do
install Pacemaker.

> Failover works properly when the master is halted/rebooted.  The problem  
> happens only when it loses networking.
> 
> 
> 
> Here are the configs. 
> ====================== 
> /etc/ha.d/haresources 
> primary 172.20.20.234 drbddisk::replicate-volume  
> Filesystem::/dev/drbd0::/replicate-volume::ext3 
> 
> ====================== 
> /etc/ha.d/ha.cf 
> debugfile /var/log/ha-debug 
> 
>     logfile /var/log/ha-log 
>     logfacility     local0 
>     keepalive 1 
>     deadtime 20 
>     warntime 5 
>     initdead 60 
>     udpport 694 
>     ucast eth0 172.20.20.35 
>     ucast eth0 172.20.20.235 
>         bcast eth1

man cl_status, look for "listhblinks" and "hblinkstatus". Figure out if
both your links are actually up and the nodes can see each other.

>     auto_failback on 
>     node primary 
>     node secondary 
> #        ping_group always_up_nodes 172.20.20.1 
> #respawn hacluster /usr/lib/heartbeat/ipfail 
> #ping 172.20.20.1 
> auto_failback off 
> respawn hacluster /usr/lib/heartbeat/dopd 

Is this a 32-bit system? Sure that path is correct for your platform?

> apiauth dopd gid=haclient uid=hacluster 
> ====================== 
> /etc/drbd.conf 
> global { usage-count yes; } 
> 
> common { 
>   protocol C; 
> } 
> 
> 
> resource replicate-volume { 
>   disk { 
>     fencing   resource-only; 
>   } 
> 
>   handlers { 
> #    split-brain "/usr/lib/drbd/notify-split-brain.sh root"; 
> 
>      pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh"; 
>      pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh"; 
>      local-io-error "/usr/lib/drbd/notify-io-error.sh"; 
>      split-brain "/usr/lib/drbd/notify-split-brain.sh root"; 
>      out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root"; 
> 
>      outdate-peer "/usr/lib/heartbeat/drbd-peer-outdater"; 

This should be "fence-peer" now; "outdate-peer" is a compat alias. What
DRBD version is this?

>   } 
> 
>   net { 
>     after-sb-0pri discard-younger-primary; 
>     after-sb-1pri discard-secondary; 
>     after-sb-2pri call-pri-lost-after-sb; 

Please, no. You're signing up for losing data after split brain. Leave
these at the defaults. You're emulating DRBD 0.7 behavior, which is the
wrong thing to do (DRBD has gotten much smarter since).

>   } 
> 
>   startup { 
>     wfc-timeout 60; 
>   } 
> 
> syncer { 
>     rate 12M; 
> } 
> 
>   on primary { 
>     device     /dev/drbd0; 
>     disk       /dev/sdb1; 
>     address    172.20.20.35:7788; 
>     meta-disk  internal; 
>   } 
> 
>   on secondary { 
>     device    /dev/drbd0; 
>     disk      /dev/sdb1; 
>     address   172.20.20.235:7788; 
>     meta-disk internal; 
>   } 
> 
> } 

Now, if all your links are actually up, then DRBD should do as you
expect (replication link dies, DRBD's Secondary node gets outdated,
promotion fails), and my current hunch is that your links are fishy.
But, really, please go back to square one and get this set up with
Pacemaker, and once that is set up test link failure.

Cheers,
Florian

_______________________________________________
drbd-user mailing list
drbd-user at lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user