[DRBD-user] fence-peer helper broken, returned 0

Thu Mar 11 16:16:30 CET 2010

Hello,

> please use heartbeat 3.0 with pacemaker, and drbd 8.3.7

Okay, I will read up on it, and give it another try. I used Heartbeat i V1-mode when I tried it.

> if you want to get up to speed with heartbeat 3.0 + pacemaker,
> we are also doing workshops, you know? ;)

Very interesting indeed, it would be very nice if you could send me a seperate e-mail, with info about those! :)

> make sure heartbeat is stopped first,
> drbd is stopped somewhen later,
> network is stopped last.

I am sorry, I forgot some info in my first post. We are trying a HA crash test by powering off the primary host.

> If the other one never took over, as you say,
> how do they manage to diverge?

They both go into StandAlone and Secondary mode. In the log it tells me that it detected a split brain, but I guess the two hosts aren't really different, as the standby host never went Primary.

Best Regards,

Mikkel R. Jakobsen
Systems Consultant
DANSUPPORT A/S

-----Oprindelig meddelelse-----
Fra: drbd-user-bounces at lists.linbit.com [mailto:drbd-user-bounces at lists.linbit.com] På vegne af Lars Ellenberg
Sendt: 11. marts 2010 15:31
Til: drbd-user at lists.linbit.com
Emne: Re: [DRBD-user] fence-peer helper broken, returned 0

On Thu, Mar 11, 2010 at 02:34:27PM +0100, Mikkel Raakilde Jakobsen wrote:
> Hi,

Hi Mikkel,
how is life?

> We have the following setup:
> 
> Two physical servers installed with DRBD 8.3.2 and Heartbeat 2.1.3 on
> CentOS 5.4. Everything installed via official RPM packages in CentOS'
> repositories.

so much for "official"...
please use heartbeat 3.0 with pacemaker, and drbd 8.3.7.

> They have two bonded direct links between them for DRBD replication, and
> two other bonded links for all other traffic (management, iSCSI etc.)
> 
> We can do hb_takeover from host to host without any issues.

Ah, you are still doing haresources mode.
Well, if it works for you, fine.

if you want to get up to speed with heartbeat 3.0 + pacemaker,
we are also doing workshops, you know? ;)

> When we power off the primary host,

make sure heartbeat is stopped first,
drbd is stopped somewhen later,
network is stopped last.

just double check that, please.

> the other host tries to take over,
> but never succeeds.
> 
> We see the following lines in the log several times, until heartbeat
> gives up, and goes standby again:
> 
> block drbd0: helper command: /sbin/drbdadm fence-peer minor-0 exit code
> 0 (0x0)
> block drbd0: fence-peer helper broken, returned 0
> block drbd0: helper command: /sbin/drbdadm fence-peer minor-0
> 
> After the "failed" node gets powered on again, they are in a split-brain
> condition.

If the other one never took over, as you say,
how do they manage to diverge?

> We have tried compiling the latest DRBD and Heartbeat and using those,
> but the error is the same.
> 
> Here is our drbd.conf:
> resource r0 {
>         protocol C;
> 
>         startup { wfc-timeout 0; }
> 
>         disk { on-io-error detach;
>                 no-disk-barrier;
>                 no-disk-flushes;
>                 no-md-flushes;
>                 fencing resource-only;

Maybe actually _configure_ a fence peer handler?
if you opt for dopd, you need to also start it...

>         }
> 
>         net {
>                 max-buffers 20000;
>                 max-epoch-size 20000;
>                 sndbuf-size 1M;
>         }
> 
>         syncer { rate 2000M;
>                  al-extents 1201; }
> 
>         on server1 {
>                 device /dev/drbd0;
>                 disk /dev/dm-1;
>                 address 172.16.0.127:7788;
>                 meta-disk internal;
>         }
> 
>         on server2 {
>                 device /dev/drbd0;
>                 disk /dev/dm-1;
>                 address 172.16.0.227:7788;
>                 meta-disk internal;
>         }
> 
> 
> Here is our ha.cf:
> use_logd        yes
> keepalive       1
> deadtime        10
> warntime        10
> initdead        20
> udpport         694
> ucast           bond0.20 10.0.0.127
> auto_failback   off
> node            server1 server2
> 
> uuidfrom        nodename
> respawn hacluster /usr/lib/heartbeat/ipfail
> ping            10.0.0.1
> deadping        20
> 
> 
> How can we solve this problem?
> 
> 
> Best Regards,
> 
> Mikkel R. Jakobsen
> Systems Consultant
> DANSUPPORT A/S

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed
_______________________________________________
drbd-user mailing list
drbd-user at lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user