[DRBD-user] fence-peer helper broken, returned 0

Thu Mar 11 15:31:09 CET 2010

On Thu, Mar 11, 2010 at 02:34:27PM +0100, Mikkel Raakilde Jakobsen wrote:
> Hi,

Hi Mikkel,
how is life?

> We have the following setup:
> 
> Two physical servers installed with DRBD 8.3.2 and Heartbeat 2.1.3 on
> CentOS 5.4. Everything installed via official RPM packages in CentOS'
> repositories.

so much for "official"...
please use heartbeat 3.0 with pacemaker, and drbd 8.3.7.

> They have two bonded direct links between them for DRBD replication, and
> two other bonded links for all other traffic (management, iSCSI etc.)
> 
> We can do hb_takeover from host to host without any issues.

Ah, you are still doing haresources mode.
Well, if it works for you, fine.

if you want to get up to speed with heartbeat 3.0 + pacemaker,
we are also doing workshops, you know? ;)

> When we power off the primary host,

make sure heartbeat is stopped first,
drbd is stopped somewhen later,
network is stopped last.

just double check that, please.

> the other host tries to take over,
> but never succeeds.
> 
> We see the following lines in the log several times, until heartbeat
> gives up, and goes standby again:
> 
> block drbd0: helper command: /sbin/drbdadm fence-peer minor-0 exit code
> 0 (0x0)
> block drbd0: fence-peer helper broken, returned 0
> block drbd0: helper command: /sbin/drbdadm fence-peer minor-0
> 
> After the "failed" node gets powered on again, they are in a split-brain
> condition.

If the other one never took over, as you say,
how do they manage to diverge?

> We have tried compiling the latest DRBD and Heartbeat and using those,
> but the error is the same.
> 
> Here is our drbd.conf:
> resource r0 {
>         protocol C;
> 
>         startup { wfc-timeout 0; }
> 
>         disk { on-io-error detach;
>                 no-disk-barrier;
>                 no-disk-flushes;
>                 no-md-flushes;
>                 fencing resource-only;

Maybe actually _configure_ a fence peer handler?
if you opt for dopd, you need to also start it...

>         }
> 
>         net {
>                 max-buffers 20000;
>                 max-epoch-size 20000;
>                 sndbuf-size 1M;
>         }
> 
>         syncer { rate 2000M;
>                  al-extents 1201; }
> 
>         on server1 {
>                 device /dev/drbd0;
>                 disk /dev/dm-1;
>                 address 172.16.0.127:7788;
>                 meta-disk internal;
>         }
> 
>         on server2 {
>                 device /dev/drbd0;
>                 disk /dev/dm-1;
>                 address 172.16.0.227:7788;
>                 meta-disk internal;
>         }
> 
> 
> Here is our ha.cf:
> use_logd        yes
> keepalive       1
> deadtime        10
> warntime        10
> initdead        20
> udpport         694
> ucast           bond0.20 10.0.0.127
> auto_failback   off
> node            server1 server2
> 
> uuidfrom        nodename
> respawn hacluster /usr/lib/heartbeat/ipfail
> ping            10.0.0.1
> deadping        20
> 
> 
> How can we solve this problem?
> 
> 
> Best Regards,
> 
> Mikkel R. Jakobsen
> Systems Consultant
> DANSUPPORT A/S

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed