[DRBD-user] Can't become primary when peer goes offline.

Wed Jul 16 17:52:03 CEST 2014

On Mon, Jul 14, 2014 at 05:09:42PM -0400, Michael Monette wrote:
> I have been having this really odd issue and I can't seem to figure it out. I have tried everything I can think of and I have compared it to all my other working DRBD setups and just cannot get this thing to work. 
> 
> node-1 is primary, /dev/drbd1 is mounted at /opt
> node-2 is secondary
> both are UpToDate
> 
> shut down node-1, try to make node-2 primary and receive the error:
> 
> 1: State change failed: (-7) Refusing to be Primary while peer is not outdated
> Command 'drbdsetup primary 1' terminated with exit code 11
> 
> Also check out this one as well:
> 
> node-1 is primary, /dev/drbd1 is mounted at /opt
> node-2 is secondary 
> both are UpToDate(same as before)
> 
> This time, I shut down node-2(secondary). Everything is fine and continues to run normally on node-1. I unmount /dev/drbd1 and put it into secondary, and immediately put it back into primary:
> 
> umount /dev/drbd1
> drbdadm secondary all; drbdadm primary all # I ran these commands in one line so it switches as quick as possible.
> 1: State change failed: (-7) Refusing to be Primary while peer is not outdated
> Command 'drbdsetup primary 1' terminated with exit code 11
> 
> iptables is off, SELinux is off. I ran the drbdadm secondary and drbdadm primary in one line so it is as quick as possible. It was just running fine as a primary, so why can't I even make it a secondary, then make it primary again? Out of the 30+ times I have set this up, I have never encountered this problem. 
> 
> When either of the peers go offline, cat /proc/drbd shows:
> 
> # cat /proc/drbd
> version: 8.4.4 (api:1/proto:86-101)
> GIT-hash: 599f286440bd633d15d5ff985204aff4bccffadd build by phil at Build64R6, 2013-10-14 15:33:06
> 
>  1: cs:WFConnection ro:Secondary/Unknown ds:UpToDate/DUnknown C r-----
>     ns:0 nr:0 dw:0 dr:664 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
> 
> If I restart DRBD and abort the timeout on the surviving node, it changes to this:
> 
> # cat /proc/drbd
> version: 8.4.4 (api:1/proto:86-101)
> GIT-hash: 599f286440bd633d15d5ff985204aff4bccffadd build by phil at Build64R6, 2013-10-14 15:33:06
> 
>  1: cs:WFConnection ro:Secondary/Unknown ds:Consistent/DUnknown C r-----
>     ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
> 
> Here is my config:
> 
> ##########
> 
> resource r0 {
> protocol C;
> net {
>         cram-hmac-alg sha1;
>         shared-secret "pazzwurd1";
>         max-epoch-size 512;
>         sndbuf-size 0;
>     }
> startup {
>         wfc-timeout 30;
>         outdated-wfc-timeout 20;
>         degr-wfc-timeout 30;
>     }
> disk {
>         on-io-error detach;
>         fencing resource-only;
>     }
> syncer {
> rate 100M;
> }
> handlers {
>         fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
>         after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
>     }
> volume 0 {
> device /dev/drbd1;
> disk /dev/mapper/vg_ottppencrzdb1-lv_pgsql;
> meta-disk internal;
> }
> on db-node-1.myco.com {
> address 172.16.99.1:7789;
> }
> on db-node-2.myco.com {
> address 172.16.99.2:7789;
> }
> }
> 
> ##########
> 
> 
> I have tried to remove the fencing handlers and it did not help.
> I haven't even gotten to the pacemaker stage yet anyways.

There. *that* is your problem.
A fence-peer-handler that uses pacemaker
cannot possibly work without pacemaker.

I would think that you should find loud complaints
about that in your system logs.

If you tell DRBD to use a fence handler,
that handler has to report success,
or you get above behavior,
by design and configuration.

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed