[DRBD-user] Cluster not starting after ungraceful reboot

Sun Jun 12 09:18:40 CEST 2016

Hello,

I have been able to get a successful cluster started, and even after one or
two reboots, but sometimes ungracefully (the cluster is on and I just sudo
reboot)

Doing some testing for how it handles that situation. Sometimes the boot
just hangs, and I have to manually reset the VM (esxi) I was wondering if
its doing something or just literally hanging. The Console shows nothing.

Regardless, I am e-mailing because my cluster finally started, doesn't throw
any errors that I can see, but now it just stays with everything stopped and
the two nodes online.

Can you take a look at my confis? Any logs you might need? What could be the
culprit? Much Appreciated!

DRBD Config - NO CLUSTER 

sed -i 's/\(^SELINUX=\).*/\SELINUX=disabled/' /etc/selinux/config && reboot

sestatus

systemctl stop firewalld

systemctl disable firewalld

rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org

rpm -Uvh http://www.elrepo.org/elrepo-release-7.0-2.el7.elrepo.noarch.rpm

yum install -y kmod-drbd84 drbd84-utils mariadb-server mariadb

systemctl disable mariadb.service

fdisk /dev/sdb

cat << EOL >/etc/drbd.d/sql.res

resource sql {

protocol C;

meta-disk internal;

device /dev/drbd0;

disk   /dev/sdb1;

net {

  allow-two-primaries;

}

syncer {

  verify-alg sha1;

}

on node1.freesoftwareservers.com {

  address  192.168.1.216:7788;

}

on node2.freesoftwareservers.com {

  address  192.168.1.219:7788;

}

}

EOL

drbdadm create-md sql

modprobe drbd

drbdadm up sql

# Node1 

drbdadm primary --force sql

watch cat /proc/drbd

mkfs.xfs /dev/drbd0

mount /dev/drbd0 /mnt

df -h | grep drbd

systemctl start mariadb

mysql_install_db --datadir=/mnt --user=mysql

umount /mnt

systemctl stop mariadb

Cluster Configs - Post DRBD Configuration

yum install -y pcs  policycoreutils-python psmisc

echo "passwd" | passwd hacluster --stdin

systemctl start pcsd.service

systemctl enable pcsd.service

# Node1

pcs cluster auth node1 node2 -u hacluster -p passwd

pcs cluster setup --force --name mysql_cluster node1 node2

pcs cluster start --all

# Wait 2 seconds

pcs property set stonith-enabled=false

pcs property set no-quorum-policy=ignore

pcs cluster start --all

# Wait 5 seconds

pcs status

pcs resource create sql_drbd_res ocf:linbit:drbd \

  drbd_resource=sql \

  op monitor interval=30s

pcs resource master SQLClone sql_drbd_res \

  master-max=1 master-node-max=1 \

  clone-max=2 clone-node-max=1 \

  notify=true

pcs resource create sql_fs Filesystem \

  device="/dev/drbd0" \

  directory="/var/lib/mysql" \

  fstype="xfs"

pcs resource create sql_service ocf:heartbeat:mysql \

  binary="/usr/bin/mysqld_safe" \

  config="/etc/my.cnf" \

  datadir="/var/lib/mysql" \

  pid="/var/lib/mysql/mysql.pid" \

  socket="/var/lib/mysql/mysql.sock" \

  additional_parameters="--bind-address=0.0.0.0" \

  op start timeout=60s \

  op stop timeout=60s \

  op monitor interval=20s timeout=30s

pcs resource create sql_virtual_ip ocf:heartbeat:IPaddr2 \

ip=192.168.1.215 cidr_netmask=32 nic=eth0 \

op monitor interval=30s

pcs constraint colocation add sql_virtual_ip with sql_service INFINITY

pcs constraint colocation add sql_fs with SQLClone \

  INFINITY with-rsc-role=Master

pcs constraint order promote SQLClone then start sql_fs

pcs constraint colocation add sql_service with sql_fs INFINITY

pcs constraint order sql_fs then sql_service

pcs resource group add SQL-Group sql_service sql_fs sql_virtual_ip

pcs cluster start --all

pcs status

[root at node1 ~]# pcs status

Cluster name: mysql_cluster

Last updated: Sun Jun 12 03:04:44 2016          Last change: Sun Jun 12
02:50:16 2016 by root via cibadmin on node1

Stack: corosync

Current DC: node1 (version 1.1.13-10.el7_2.2-44eb2dd) - partition with
quorum

2 nodes and 5 resources configured

Online: [ node1 node2 ]

Full list of resources:

Master/Slave Set: SQLClone [sql_drbd_res]

     Masters: [ node1 ]

     Slaves: [ node2 ]

Resource Group: SQL-Group

     sql_service        (ocf::heartbeat:mysql): Stopped

     sql_fs     (ocf::heartbeat:Filesystem):    Stopped

     sql_virtual_ip     (ocf::heartbeat:IPaddr2):       Stopped

PCSD Status:

  node1: Online

  node2: Online

Daemon Status:

  corosync: active/disabled

  pacemaker: active/disabled

  pcsd: active/enabled

[root at node1 ~]# grep -e ERROR -e WARN /var/log/messages

Jun 12 02:07:24 node1 drbd(sql_drbd_res)[3022]: ERROR: meta parameter
misconfigured, expected clone-max -le 2, but found unset.

Jun 12 02:07:24 node1 drbd(sql_drbd_res)[3051]: ERROR: meta parameter
misconfigured, expected clone-max -le 2, but found unset.

Jun 12 02:07:51 node1 Filesystem(sql_fs)[4094]: ERROR: Couldn't unmount
/var/lib/mysql; trying cleanup with TERM

Jun 12 02:12:03 node1 Filesystem(sql_fs)[2019]: WARNING: Couldn't find
device [/dev/drbd0]. Expected /dev/??? to exist

Jun 12 02:12:03 node1 drbd(sql_drbd_res)[2214]: ERROR: meta parameter
misconfigured, expected clone-max -le 2, but found unset.

Jun 12 02:12:04 node1 Filesystem(sql_fs)[2248]: ERROR: Couldn't find device
[/dev/drbd0]. Expected /dev/??? to exist

Jun 12 02:12:04 node1 Filesystem(sql_fs)[2384]: WARNING: Couldn't find
device [/dev/drbd0]. Expected /dev/??? to exist

Jun 12 02:38:07 node1 drbd: WARN: stdin/stdout is not a TTY; using
/dev/console.

[root at node1 ~]#

Note : The cluster stared at some point after 2:12 I am fairly sure, that
was my debugging time, I just included this to show the lack of any errors.

[root at node1 ~]# tail /var/log/cluster/corosync.log

Jun 12 03:12:31 [2021] node1.freesoftwareservers.com       crmd:   notice:
print_synapse:       [Action   43]: Pending pseudo op SQL-Group_running_0
on N/A (priority: 0, waiting:  36 38 40)

Jun 12 03:12:31 [2021] node1.freesoftwareservers.com       crmd:   notice:
print_synapse:       [Action   42]: Completed pseudo op SQL-Group_start_0
on N/A (priority: 0, waiting: none)

Jun 12 03:12:31 [2021] node1.freesoftwareservers.com       crmd:   notice:
print_synapse:       [Action   37]: Pending rsc op sql_service_monitor_20000
on node1 (priority: 0, waiting:  36)

Jun 12 03:12:31 [2021] node1.freesoftwareservers.com       crmd:   notice:
print_synapse:       [Action   36]: Pending rsc op sql_service_start_0
on node1 (priority: 0, waiting:  38)

Jun 12 03:12:31 [2021] node1.freesoftwareservers.com       crmd:   notice:
print_synapse:       [Action   39]: Pending rsc op sql_fs_monitor_20000
on node1 (priority: 0, waiting:  38)

Jun 12 03:12:31 [2021] node1.freesoftwareservers.com       crmd:   notice:
print_synapse:       [Action   38]: Pending rsc op sql_fs_start_0
on node1 (priority: 0, waiting:  36)

Jun 12 03:12:31 [2021] node1.freesoftwareservers.com       crmd:   notice:
print_synapse:       [Action   41]: Pending rsc op
sql_virtual_ip_monitor_30000        on node1 (priority: 0, waiting:  40)

Jun 12 03:12:31 [2021] node1.freesoftwareservers.com       crmd:   notice:
print_synapse:       [Action   40]: Pending rsc op sql_virtual_ip_start_0
on node1 (priority: 0, waiting:  38)

Jun 12 03:12:31 [2021] node1.freesoftwareservers.com       crmd:     info:
do_log:      FSA: Input I_TE_SUCCESS from notify_crmd() received in state
S_TRANSITION_ENGINE

Jun 12 03:12:31 [2021] node1.freesoftwareservers.com       crmd:   notice:
do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [
input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]

[root at node1 ~]#

Post Reboot :

#!/bin/bash

#/root/drbdstart.sh

#This should run on both nodes after post-mortem if unexpected failure

modprobe drbd 

drbdadm up sql 

#!/bin/bash

#/root/clusterstart.sh

#This should run on primary node after post-mortem if unexpected failure

drbdadm primary --force sql

cat /proc/drbd

pcs cluster start --all

pcs status

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20160612/9283e5a2/attachment.htm>