[DRBD-user] Cancelling pending actions

Tue Oct 3 11:02:00 CEST 2017

Hi Martyn..

To fix connectivity issues with DRBD

open up 2 ssh sessions to both nodes

on one SSH session for each node run the following command

watch cat /proc/drbd
this will allow you to monitor the status of the nodes as they attempt
to reconnect

on the node that states that it is secondary (it should have something like:)

0:StandAlone st:Secondary/Unknown ds:UpToDate/DUnknown
and primary should look like this

0: cs:StandAlone st:Primary/Unknown ds:UpToDate/DUnknown   r---

if you are using heartbeat to control your drbd you should stop it

(you can use the resource name here if you are running more than one
DRBD device and only one is broken)

on both nodes type:

drbdadm down all
drbdadm up all
both nodes will probably report that they are in a secondary state now
make one primary (the one that you believe is the latest or the one
that previously reported that it was primary)

drbdadm primary all
and then on both nodes

drbdadm connect all
if that does not work you will have to outdate the secondary node

on secondary:

drbdadm outdate all
and then try the connection again on both nodes

drbdadm connect all
if this does not work you should invalidate the secondary node and
retry the connection

if at this point you are unable to get the nodes to talk to each other
check for a split brained situation.

run

dmesg |grep drbd
and have a look along the last few lines for

drbd0: Split-Brain detected, dropping connection!

if this is there you will have to sacrafice data on one of the nodes

choose the node that you feel is incorrect (if you followed the above
it is your secondary node)

and run

drbdadm -- --discard-my-data connect all

and on the primary

drbdadm connect all
drbdadm primary all

and you should see that both nodes connect and are syncing again

if you are using heartbeat you will have to get the cluster back into
its correct config

on both nodes

drbdadm down all
service drbd stop
service heartbeat start
drbd will be stopped and restarted by heartbeat, it will take some
time to restart heartbeat depending on your timeout settings, but once
it comes back up you should see data from within your watch cat
/proc/drbd window stating that one node has gone primary and is in
sync

the following will make the current DRBD system secondary and ditch
split brain Data in one go (remote has to be added to the host file
and a passwordless login should be set up before doing this)

drbdadm -- --discard-my-data connect storage
ssh remote "drbdadm connect all"

you can also add the following to your drbd resource config for
automated split brain recovery

resource <resource> {
 handlers {
   split-brain "/usr/lib/drbd/notify-split-brain.sh root";
   ...
 }
 net {
   after-sb-0pri discard-zero-changes;
   after-sb-1pri discard-secondary;
   after-sb-2pri disconnect;
   ...
 }
 ...
}

it should now be possible to use drbdmanage to do this for you

drbdmanage net-options --resource storage --after-sb-0pri
discard-zero-changes --after-sb-1pri discard-secondary --after-sb-2pri
disconnect
drbdmanage handlers --resource storage --split-brain
/usr/lib/drbd/notify-split-brain.sh

Once you have confirmed that the data is valid you can scrub the
drbdmanage configuration with the drbdmanage uninit command, please
ensure that you have enough valid nodes in your drbdmanage cluster to
have quorum and to allow the services to start,

I use the following to quickly blow away the local configuration from a node

Scrub DRBD Configuration from a node
On the broken node:

drbdadm down all
drbdadm down .drbdctrl
drbdmanage uninit
vgremove drbdpool # if you get an error here please reboot the server
or check pvscan for additional volumes mapped by lvmonitor incorrectly
vgcreate drbdpool /dev/sdb

On the working node

drbdmanage rn nodename.domain.name --force
drbdmanage an nodename.domain.name 10.x.x.x

Jay

On 2 October 2017 at 11:37, Martyn Spencer
<msdreg_linbit at microdata.co.uk> wrote:
> I am testing a three node DRBD 9.0.9 setup using packages I built for
> CentOS7. I am using the latest drbdmanage and drbd-utils versions. If I lose
> the data on the resources, it is fine (I am only testing) but I was wanting
> to learn how to manage (if possible) the mess that I have just caused :)
>
> Two nodes were working fine; let's call them node1 and node2.
>
> When I attempted to add node3, without storage, it failed. This is something
> I will worry about later.
>
> I managed to put node1 into a state where it had pending actions that I
> could not remove, so decided to remove the node and then re-add it. Rather
> naively I did not check and the DRBD resources were all role:primary on
> node1. Now node1 is in a state "pending: remove" and I cannot in any way
> seem to add it back to the cluster. If I use list-assignments, I can see
> that the resources all have pending actions "decommission" against node1. I
> am quite clear that DRBD is doing exactly what I asked it to do, and it also
> looks as though it is protecting me from my own mistakes somewhat (since the
> underyling DRBD resources appear to be OK).
>
> I would like to ensure that the data that is in the resources on node1 is
> synchronised with node2 before doing anything else. At present, all the
> node1 resources are showing as "UpToDate" and "connecting" and the node2
> resources are showing as "Outdated" and they are not attempting to reconnect
> to node1.
>
> Is there a way to force them to connect to node1 to resynchronise before I
> continue?
>
> Many thanks,
>
> Martyn Spencer
>
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user

-- 

"The only difference between saints and sinners is that every saint
has a past while every sinner has a future. "
— Oscar Wilde