[DRBD-user] drbd fencing stops promotion to master even when network connection is up

Tue Sep 20 14:12:02 CEST 2016

Hi,

I've updated all drbd packages to the latest versions:
MDA1PFP-S01 11:52:35 2551 0 ~ # yum list "*drbd*"
Loaded plugins: langpacks, product-id, search-disabled-repos, subscription-manager
Installed Packages
drbd.x86_64                                                                                    8.9.8-1.el7                                                                                     @/drbd-8.9.8-1.el7.x86_64                     
drbd-bash-completion.x86_64                                                                    8.9.8-1.el7                                                                                     @/drbd-bash-completion-8.9.8-1.el7.x86_64     
drbd-heartbeat.x86_64                                                                          8.9.8-1.el7                                                                                     @/drbd-heartbeat-8.9.8-1.el7.x86_64           
drbd-pacemaker.x86_64                                                                          8.9.8-1.el7                                                                                     @/drbd-pacemaker-8.9.8-1.el7.x86_64           
drbd-udev.x86_64                                                                               8.9.8-1.el7                                                                                     @/drbd-udev-8.9.8-1.el7.x86_64                
drbd-utils.x86_64                                                                              8.9.8-1.el7                                                                                     installed                                     
drbd-xen.x86_64                                                                                8.9.8-1.el7                                                                                     @/drbd-xen-8.9.8-1.el7.x86_64                 
kmod-drbd.x86_64                                                                               9.0.4_3.10.0_327.28.3-1.el7                                                                     @/kmod-drbd-9.0.4_3.10.0_327.28.3-1.el7.x86_64

but this did not fix the problem. The cluster starts fine, but when I stop the node with the DRBD master the resource is not promoted on the other node.
Here is the test I am conducting:
1. start cluster
MDA1PFP-S01 12:07:00 2566 0 ~ # pcs status
Cluster name: MDA1PFP
Last updated: Tue Sep 20 12:07:24 2016		Last change: Tue Sep 20 12:06:49 2016 by root via cibadmin on MDA1PFP-PCS02
Stack: corosync
Current DC: MDA1PFP-PCS01 (version 1.1.13-10.el7-44eb2dd) - partition with quorum
2 nodes and 6 resources configured

Online: [ MDA1PFP-PCS01 MDA1PFP-PCS02 ]

Full list of resources:

 mda-ip	(ocf::heartbeat:IPaddr2):	Started MDA1PFP-PCS01
 Clone Set: ping-clone [ping]
     Started: [ MDA1PFP-PCS01 MDA1PFP-PCS02 ]
 ACTIVE	(ocf::heartbeat:Dummy):	Started MDA1PFP-PCS01
 Master/Slave Set: drbd1_sync [drbd1]
     Masters: [ MDA1PFP-PCS01 ]
     Slaves: [ MDA1PFP-PCS02 ]

PCSD Status:
  MDA1PFP-PCS01: Online
  MDA1PFP-PCS02: Online

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

MDA1PFP-S01 12:06:31 2565 0 ~ # drbd-overview 
 1:shared_fs/0  Connected Primary/Secondary UpToDate/UpToDate 

2. stop active cluster node
MDA1PFP-S02 12:08:00 1295 0 ~ # pcs status
Cluster name: MDA1PFP
Last updated: Tue Sep 20 12:08:17 2016		Last change: Tue Sep 20 12:08:04 2016 by root via cibadmin on MDA1PFP-PCS02
Stack: corosync
Current DC: MDA1PFP-PCS02 (version 1.1.13-10.el7-44eb2dd) - partition with quorum
2 nodes and 6 resources configured

Online: [ MDA1PFP-PCS02 ]
OFFLINE: [ MDA1PFP-PCS01 ]

Full list of resources:

 mda-ip	(ocf::heartbeat:IPaddr2):	Started MDA1PFP-PCS02
 Clone Set: ping-clone [ping]
     Started: [ MDA1PFP-PCS02 ]
     Stopped: [ MDA1PFP-PCS01 ]
 ACTIVE	(ocf::heartbeat:Dummy):	Started MDA1PFP-PCS02
 Master/Slave Set: drbd1_sync [drbd1]
     Slaves: [ MDA1PFP-PCS02 ]
     Stopped: [ MDA1PFP-PCS01 ]

PCSD Status:
  MDA1PFP-PCS01: Online
  MDA1PFP-PCS02: Online

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

In the log files I can see that the node actually gets promoted to master, but then gets demoted immediately, but I don't see the reason for doing this:
Sep 20 12:08:00 MDA1PFP-S02 rsyslogd: [origin software="rsyslogd" swVersion="7.4.7" x-pid="3224" x-info="http://www.rsyslog.com"] start
Sep 20 12:08:00 MDA1PFP-S02 rsyslogd-2221: module 'imuxsock' already in this config, cannot be added
 [try http://www.rsyslog.com/e/2221 ]
Sep 20 12:08:00 MDA1PFP-S02 systemd: Stopping System Logging Service...
Sep 20 12:08:00 MDA1PFP-S02 systemd: Starting System Logging Service...
Sep 20 12:08:00 MDA1PFP-S02 systemd: Started System Logging Service.
Sep 20 12:08:03 MDA1PFP-S02 crmd[2354]:  notice: Operation ACTIVE_start_0: ok (node=MDA1PFP-PCS02, call=29, rc=0, cib-update=21, confirmed=true)
Sep 20 12:08:03 MDA1PFP-S02 crmd[2354]:  notice: Operation drbd1_notify_0: ok (node=MDA1PFP-PCS02, call=28, rc=0, cib-update=0, confirmed=true)
Sep 20 12:08:04 MDA1PFP-S02 kernel: block drbd1: peer( Primary -> Secondary ) 
Sep 20 12:08:04 MDA1PFP-S02 IPaddr2(mda-ip)[3528]: INFO: Adding inet address 192.168.120.20/32 with broadcast address 192.168.120.255 to device bond0
Sep 20 12:08:04 MDA1PFP-S02 avahi-daemon[1084]: Registering new address record for 192.168.120.20 on bond0.IPv4.
Sep 20 12:08:04 MDA1PFP-S02 IPaddr2(mda-ip)[3528]: INFO: Bringing device bond0 up
Sep 20 12:08:04 MDA1PFP-S02 IPaddr2(mda-ip)[3528]: INFO: /usr/libexec/heartbeat/send_arp -i 200 -r 5 -p /var/run/resource-agents/send_arp-192.168.120.20 bond0 192.168.120.20 auto not_used not_used
Sep 20 12:08:04 MDA1PFP-S02 crmd[2354]:  notice: Operation mda-ip_start_0: ok (node=MDA1PFP-PCS02, call=31, rc=0, cib-update=23, confirmed=true)
Sep 20 12:08:04 MDA1PFP-S02 crmd[2354]:  notice: Operation drbd1_notify_0: ok (node=MDA1PFP-PCS02, call=32, rc=0, cib-update=0, confirmed=true)
Sep 20 12:08:04 MDA1PFP-S02 crmd[2354]:  notice: Operation drbd1_notify_0: ok (node=MDA1PFP-PCS02, call=34, rc=0, cib-update=0, confirmed=true)
Sep 20 12:08:04 MDA1PFP-S02 kernel: drbd shared_fs: peer( Secondary -> Unknown ) conn( Connected -> TearDown ) pdsk( UpToDate -> DUnknown ) 
Sep 20 12:08:04 MDA1PFP-S02 kernel: drbd shared_fs: ack_receiver terminated
Sep 20 12:08:04 MDA1PFP-S02 kernel: drbd shared_fs: Terminating drbd_a_shared_f
Sep 20 12:08:04 MDA1PFP-S02 kernel: drbd shared_fs: Connection closed
Sep 20 12:08:04 MDA1PFP-S02 kernel: drbd shared_fs: conn( TearDown -> Unconnected ) 
Sep 20 12:08:04 MDA1PFP-S02 kernel: drbd shared_fs: receiver terminated
Sep 20 12:08:04 MDA1PFP-S02 kernel: drbd shared_fs: Restarting receiver thread
Sep 20 12:08:04 MDA1PFP-S02 kernel: drbd shared_fs: receiver (re)started
Sep 20 12:08:04 MDA1PFP-S02 kernel: drbd shared_fs: conn( Unconnected -> WFConnection ) 
Sep 20 12:08:04 MDA1PFP-S02 crmd[2354]:  notice: Operation drbd1_notify_0: ok (node=MDA1PFP-PCS02, call=35, rc=0, cib-update=0, confirmed=true)
Sep 20 12:08:04 MDA1PFP-S02 crmd[2354]:  notice: Operation drbd1_notify_0: ok (node=MDA1PFP-PCS02, call=36, rc=0, cib-update=0, confirmed=true)
Sep 20 12:08:04 MDA1PFP-S02 kernel: drbd shared_fs: helper command: /sbin/drbdadm fence-peer shared_fs
Sep 20 12:08:04 MDA1PFP-S02 crm-fence-peer.sh[3779]: invoked for shared_fs
Sep 20 12:08:04 MDA1PFP-S02 crm-fence-peer.sh[3779]: INFO peer is not reachable, my disk is UpToDate: placed constraint 'drbd-fence-by-handler-shared_fs-drbd1_sync'
Sep 20 12:08:04 MDA1PFP-S02 kernel: drbd shared_fs: helper command: /sbin/drbdadm fence-peer shared_fs exit code 5 (0x500)
Sep 20 12:08:04 MDA1PFP-S02 kernel: drbd shared_fs: fence-peer helper returned 5 (peer is unreachable, assumed to be dead)
Sep 20 12:08:04 MDA1PFP-S02 kernel: drbd shared_fs: pdsk( DUnknown -> Outdated ) 
Sep 20 12:08:04 MDA1PFP-S02 kernel: block drbd1: role( Secondary -> Primary ) 
Sep 20 12:08:04 MDA1PFP-S02 kernel: block drbd1: new current UUID 098EF9936C4F4D27:5157BB476E60F5AA:6BC19D97CF96E5D2:6BC09D97CF96E5D2
Sep 20 12:08:04 MDA1PFP-S02 crmd[2354]:   error: pcmkRegisterNode: Triggered assert at xml.c:594 : node->type == XML_ELEMENT_NODE
Sep 20 12:08:04 MDA1PFP-S02 crmd[2354]:  notice: Operation drbd1_promote_0: ok (node=MDA1PFP-PCS02, call=37, rc=0, cib-update=25, confirmed=true)
Sep 20 12:08:04 MDA1PFP-S02 crmd[2354]:  notice: Operation drbd1_notify_0: ok (node=MDA1PFP-PCS02, call=38, rc=0, cib-update=0, confirmed=true)
Sep 20 12:08:04 MDA1PFP-S02 crmd[2354]:  notice: Our peer on the DC (MDA1PFP-PCS01) is dead
Sep 20 12:08:04 MDA1PFP-S02 crmd[2354]:  notice: State transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callback ]
Sep 20 12:08:04 MDA1PFP-S02 crmd[2354]:  notice: State transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC cause=C_TIMER_POPPED origin=election_timeout_popped ]
Sep 20 12:08:04 MDA1PFP-S02 attrd[2351]:  notice: crm_update_peer_proc: Node MDA1PFP-PCS01[1] - state is now lost (was member)
Sep 20 12:08:04 MDA1PFP-S02 attrd[2351]:  notice: Removing all MDA1PFP-PCS01 attributes for attrd_peer_change_cb
Sep 20 12:08:04 MDA1PFP-S02 attrd[2351]:  notice: Lost attribute writer MDA1PFP-PCS01
Sep 20 12:08:04 MDA1PFP-S02 attrd[2351]:  notice: Removing MDA1PFP-PCS01/1 from the membership list
Sep 20 12:08:04 MDA1PFP-S02 attrd[2351]:  notice: Purged 1 peers with id=1 and/or uname=MDA1PFP-PCS01 from the membership cache
Sep 20 12:08:04 MDA1PFP-S02 stonith-ng[2349]:  notice: crm_update_peer_proc: Node MDA1PFP-PCS01[1] - state is now lost (was member)
Sep 20 12:08:04 MDA1PFP-S02 stonith-ng[2349]:  notice: Removing MDA1PFP-PCS01/1 from the membership list
Sep 20 12:08:04 MDA1PFP-S02 stonith-ng[2349]:  notice: Purged 1 peers with id=1 and/or uname=MDA1PFP-PCS01 from the membership cache
Sep 20 12:08:04 MDA1PFP-S02 cib[2348]:  notice: crm_update_peer_proc: Node MDA1PFP-PCS01[1] - state is now lost (was member)
Sep 20 12:08:04 MDA1PFP-S02 cib[2348]:  notice: Removing MDA1PFP-PCS01/1 from the membership list
Sep 20 12:08:04 MDA1PFP-S02 cib[2348]:  notice: Purged 1 peers with id=1 and/or uname=MDA1PFP-PCS01 from the membership cache
Sep 20 12:08:04 MDA1PFP-S02 crmd[2354]: warning: FSA: Input I_ELECTION_DC from do_election_check() received in state S_INTEGRATION
Sep 20 12:08:04 MDA1PFP-S02 crmd[2354]:  notice: Notifications disabled
Sep 20 12:08:04 MDA1PFP-S02 crmd[2354]:   error: pcmkRegisterNode: Triggered assert at xml.c:594 : node->type == XML_ELEMENT_NODE
Sep 20 12:08:04 MDA1PFP-S02 pengine[2353]:  notice: On loss of CCM Quorum: Ignore
Sep 20 12:08:04 MDA1PFP-S02 pengine[2353]:  notice: Demote  drbd1:0	(Master -> Slave MDA1PFP-PCS02)
Sep 20 12:08:04 MDA1PFP-S02 pengine[2353]:  notice: Calculated Transition 0: /var/lib/pacemaker/pengine/pe-input-1813.bz2
Sep 20 12:08:04 MDA1PFP-S02 crmd[2354]:  notice: Initiating action 55: notify drbd1_pre_notify_demote_0 on MDA1PFP-PCS02 (local)
Sep 20 12:08:04 MDA1PFP-S02 crmd[2354]:  notice: Operation drbd1_notify_0: ok (node=MDA1PFP-PCS02, call=39, rc=0, cib-update=0, confirmed=true)
Sep 20 12:08:04 MDA1PFP-S02 crmd[2354]:  notice: Initiating action 18: demote drbd1_demote_0 on MDA1PFP-PCS02 (local)
Sep 20 12:08:04 MDA1PFP-S02 kernel: block drbd1: role( Primary -> Secondary ) 
Sep 20 12:08:04 MDA1PFP-S02 kernel: block drbd1: bitmap WRITE of 0 pages took 0 jiffies
Sep 20 12:08:04 MDA1PFP-S02 kernel: block drbd1: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
Sep 20 12:08:04 MDA1PFP-S02 systemd-udevd: error: /dev/drbd1: Wrong medium type
Sep 20 12:08:04 MDA1PFP-S02 systemd-udevd: error: /dev/drbd1: Wrong medium type
Sep 20 12:08:04 MDA1PFP-S02 crmd[2354]:   error: pcmkRegisterNode: Triggered assert at xml.c:594 : node->type == XML_ELEMENT_NODE
Sep 20 12:08:04 MDA1PFP-S02 crmd[2354]:  notice: Operation drbd1_demote_0: ok (node=MDA1PFP-PCS02, call=40, rc=0, cib-update=48, confirmed=true)
Sep 20 12:08:04 MDA1PFP-S02 crmd[2354]:  notice: Initiating action 56: notify drbd1_post_notify_demote_0 on MDA1PFP-PCS02 (local)
Sep 20 12:08:04 MDA1PFP-S02 crmd[2354]:  notice: Operation drbd1_notify_0: ok (node=MDA1PFP-PCS02, call=41, rc=0, cib-update=0, confirmed=true)
Sep 20 12:08:04 MDA1PFP-S02 crmd[2354]:  notice: Initiating action 20: monitor drbd1_monitor_60000 on MDA1PFP-PCS02 (local)
Sep 20 12:08:04 MDA1PFP-S02 crmd[2354]:   error: pcmkRegisterNode: Triggered assert at xml.c:594 : node->type == XML_ELEMENT_NODE
Sep 20 12:08:04 MDA1PFP-S02 crmd[2354]:  notice: Transition 0 (Complete=10, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-1813.bz2): Complete
Sep 20 12:08:04 MDA1PFP-S02 crmd[2354]:  notice: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Sep 20 12:08:05 MDA1PFP-S02 crmd[2354]:  notice: crm_reap_unseen_nodes: Node MDA1PFP-PCS01[1] - state is now lost (was member)
Sep 20 12:08:05 MDA1PFP-S02 pacemakerd[2335]:  notice: crm_reap_unseen_nodes: Node MDA1PFP-PCS01[1] - state is now lost (was member)
Sep 20 12:08:05 MDA1PFP-S02 crmd[2354]: warning: No match for shutdown action on 1
Sep 20 12:08:05 MDA1PFP-S02 crmd[2354]:  notice: Stonith/shutdown of MDA1PFP-PCS01 not matched
Sep 20 12:08:05 MDA1PFP-S02 crmd[2354]:  notice: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Sep 20 12:08:05 MDA1PFP-S02 corosync[2244]: [TOTEM ] A new membership (192.168.121.11:1452) was formed. Members left: 1

Best wishes,
  Jens

--
Jens Auer | CGI | Software-Engineer
CGI (Germany) GmbH & Co. KG
Rheinstraße 95 | 64295 Darmstadt | Germany
T: +49 6151 36860 154
jens.auer at cgi.com
Unsere Pflichtangaben gemäß § 35a GmbHG / §§ 161, 125a HGB finden Sie unter de.cgi.com/pflichtangaben.

CONFIDENTIALITY NOTICE: Proprietary/Confidential information belonging to CGI Group Inc. and its affiliates may be contained in this message. If you are not a recipient indicated or intended in this message (or responsible for delivery of this message to such person), or you think for any reason that this message may have been addressed to you in error, you may not use or copy or deliver this message to anyone else. In such case, you should destroy this message and are asked to notify the sender by reply e-mail.