Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
> I don't think this has anything to do with DRBD, because: OK. > Apparently, something downed the NICs for corosync communication. > Which then leads to fencing. No problem with NICs. > Maybe you should double check your network configuration, > and any automagic reconfiguration of the network, > and only start corosync once your network is "stable"? As another manifestation of similar problem of dual-Primary DRBD integrated with stonith enabled Pacemaker: When server7 goes down, the DRBD resource on surviving node server4 is attempted to be demoted as secondary. The demotion fails because DRBD is hosting a GFS2 volume and Pacemaker complains of this failure as an error. When server7 comes back UP again server4 is stonith'd probably because of this error reported by pacemaker. Thus problem I am facing is that when either node crashes: ) Surviving node behaves strangely like in this case DRBD is attempted to be demoted and failure to demote is flagged as Pacemaker error. ) When crashed nodes comes back UP then surviving node is stonith'd. Logs on surviving server4 node when server7 went down: --------------------------------------------------------- May 11 13:54:55 server4 crmd[4032]: notice: State transition S_IDLE -> S_POLICY_ENGINE May 11 13:54:55 server4 kernel: drbd vDrbd: helper command: /sbin/drbdadm fence-peer vDrbd exit code 5 (0x500) May 11 13:54:55 server4 kernel: drbd vDrbd: fence-peer helper returned 5 (peer is unreachable, assumed to be dead) May 11 13:54:55 server4 kernel: drbd vDrbd: pdsk( DUnknown -> Outdated ) May 11 13:54:55 server4 kernel: block drbd0: new current UUID 8D5135C7BB88B6BF:CB0611075C849723:FAA83EE5FF5969E7:0000000000000004 May 11 13:54:55 server4 kernel: drbd vDrbd: susp( 1 -> 0 ) May 11 13:54:55 server4 kernel: GFS2: fsid=vCluster:vGFS2.0: jid=1: Looking at journal... May 11 13:54:55 server4 crm-fence-peer.sh[17468]: INFO peer is not reachable, my disk is UpToDate: placed constraint 'drbd-fence-by-handler-vDrbd-drbd_data_clone' May 11 13:54:55 server4 kernel: GFS2: fsid=vCluster:vGFS2.0: jid=1: Acquiring the transaction lock... May 11 13:54:55 server4 kernel: GFS2: fsid=vCluster:vGFS2.0: jid=1: Replaying journal... May 11 13:54:55 server4 kernel: GFS2: fsid=vCluster:vGFS2.0: jid=1: Replayed 4 of 13 blocks May 11 13:54:55 server4 kernel: GFS2: fsid=vCluster:vGFS2.0: jid=1: Found 4 revoke tags May 11 13:54:55 server4 kernel: GFS2: fsid=vCluster:vGFS2.0: jid=1: Journal replayed in 1s May 11 13:54:55 server4 kernel: GFS2: fsid=vCluster:vGFS2.0: recover jid 1 result success May 11 13:54:55 server4 kernel: GFS2: fsid=vCluster:vGFS2.0: jid=1: Done May 11 13:54:55 server4 kernel: GFS2: fsid=vCluster:vGFS2.0: recover generation 3 done May 11 13:54:55 server4 pengine[4031]: notice: On loss of CCM Quorum: Ignore *May 11 13:54:55 server4 pengine[4031]: notice: Demote drbd_data:0#011(Master -> Slave server4ha)* May 11 13:54:55 server4 pengine[4031]: notice: Calculated transition 2, saving inputs in /var/lib/pacemaker/pengine/pe-input-39.bz2 May 11 13:54:55 server4 crmd[4032]: notice: Initiating notify operation drbd_data_pre_notify_demote_0 locally on server4ha May 11 13:54:55 server4 crmd[4032]: notice: Result of notify operation for drbd_data on server4ha: 0 (ok) May 11 13:54:55 server4 crmd[4032]: notice: Initiating demote operation drbd_data_demote_0 locally on server4ha May 11 13:54:55 server4 kernel: block drbd0: State change failed: Device is held open by someone May 11 13:54:55 server4 kernel: block drbd0: state = { cs:WFConnection ro:Primary/Unknown ds:UpToDate/Outdated r----- } May 11 13:54:55 server4 kernel: block drbd0: wanted = { cs:WFConnection ro:Secondary/Unknown ds:UpToDate/Outdated r----- } May 11 13:54:55 server4 drbd(drbd_data)[18625]: ERROR: vDrbd: Called drbdadm -c /etc/drbd.conf secondary vDrbd May 11 13:54:55 server4 drbd(drbd_data)[18625]: ERROR: vDrbd: Exit code 11 May 11 13:54:55 server4 drbd(drbd_data)[18625]: ERROR: vDrbd: Command output: May 11 13:54:55 server4 drbd(drbd_data)[18625]: ERROR: vDrbd: Called drbdadm -c /etc/drbd.conf secondary vDrbd May 11 13:54:55 server4 drbd(drbd_data)[18625]: ERROR: vDrbd: Exit code 11 May 11 13:54:55 server4 drbd(drbd_data)[18625]: ERROR: vDrbd: Command output: May 11 13:54:55 server4 kernel: block drbd0: State change failed: Device is held open by someone May 11 13:58:06 server4 lrmd[4029]: notice: drbd_data_stop_0:21199:stderr [ Command 'drbdsetup-84 secondary 0' terminated with exit code 11 ] May 11 13:58:06 server4 lrmd[4029]: notice: drbd_data_stop_0:21199:stderr [ 0: State change failed: (-12) Device is held open by someone ] May 11 13:58:06 server4 lrmd[4029]: notice: drbd_data_stop_0:21199:stderr [ Command 'drbdsetup-84 secondary 0' terminated with exit code 11 ] May 11 13:58:06 server4 crmd[4032]: error: Result of stop operation for drbd_data on server4ha: Timed Out May 11 13:58:06 server4 crmd[4032]: notice: server4ha-drbd_data_stop_0:42 [ 0: State change failed: (-12) Device is held open by someone\nCommand 'drbdsetup-84 secondary 0' terminated with exit code 11\n0: State change failed: (-12) Device is held open by someone\nCommand 'drbdsetup-84 secondary 0' terminated with exit code 11\n0: State change failed: (-12) Device is held open by someone\nCommand 'drbdsetup-84 secondary 0' terminated with exit code 11\n0: State change failed: (-12) Device is held open by someone\nCommand 'drbdsetup-84 seco *May 11 13:58:06 server4 crmd[4032]: warning: Action 5 (drbd_data_stop_0) on server4ha failed (target: 0 vs. rc: 1): Error* May 11 13:58:06 server4 crmd[4032]: notice: Transition aborted by operation drbd_data_stop_0 'modify' on server4ha: Event failed May 11 13:58:06 server4 crmd[4032]: warning: Action 5 (drbd_data_stop_0) on server4ha failed (target: 0 vs. rc: 1): Error May 11 13:58:06 server4 crmd[4032]: notice: Transition 3 (Complete=8, Pending=0, Fired=0, Skipped=0, Incomplete=1, Source=/var/lib/pacemaker/pengine/pe-input-40.bz2): Complete May 11 13:58:06 server4 pengine[4031]: notice: On loss of CCM Quorum: Ignore May 11 13:58:06 server4 pengine[4031]: warning: Processing failed op stop for drbd_data:0 on server4ha: unknown error (1) May 11 13:58:06 server4 pengine[4031]: warning: Processing failed op stop for drbd_data:0 on server4ha: unknown error (1) May 11 13:58:06 server4 pengine[4031]: warning: Node server4ha will be fenced because of resource failure(s) May 11 13:58:06 server4 pengine[4031]: warning: Forcing drbd_data_clone away from server4ha after 1000000 failures (max=1) May 11 13:58:06 server4 pengine[4031]: warning: Forcing drbd_data_clone away from server4ha after 1000000 failures (max=1) May 11 13:58:06 server4 pengine[4031]: warning: Scheduling Node server4ha for STONITH May 11 13:58:06 server4 pengine[4031]: notice: Stop of failed resource drbd_data:0 is implicit after server4ha is fenced May 11 13:58:06 server4 pengine[4031]: notice: Stop vCluster-VirtualIP-10.168.10.199#011(server4ha) May 11 13:58:06 server4 pengine[4031]: notice: Stop vCluster-Stonith-server7ha#011(server4ha) May 11 13:58:06 server4 pengine[4031]: notice: Stop dlm:0#011(server4ha) May 11 13:58:06 server4 pengine[4031]: notice: Stop clvmd:0#011(server4ha) May 11 13:58:06 server4 pengine[4031]: notice: Stop drbd_data:0#011(server4ha) May 11 13:58:06 server4 pengine[4031]: warning: Calculated transition 4 (with warnings), saving inputs in /var/lib/pacemaker/pengine/pe-warn-3.bz2 May 11 13:58:06 server4 crmd[4032]: notice: Requesting fencing (reboot) of node server4ha May 11 13:58:06 server4 stonith-ng[4028]: notice: Client crmd.4032.582f53c4 wants to fence (reboot) 'server4ha' with device '(any)' May 11 13:58:06 server4 stonith-ng[4028]: notice: Requesting peer fencing (reboot) of server4ha May 11 14:13:06 server4 stonith-ng[4028]: notice: Client crmd.4032.582f53c4 wants to fence (reboot) 'server4ha' with device '(any)' May 11 14:13:06 server4 stonith-ng[4028]: notice: Requesting peer fencing (reboot) of server4ha May 11 14:13:06 server4 stonith-ng[4028]: notice: vCluster-Stonith-server7ha can not fence (reboot) server4ha: static-list May 11 14:13:06 server4 stonith-ng[4028]: notice: Couldn't find anyone to fence (reboot) server4ha with any device May 11 14:13:06 server4 stonith-ng[4028]: error: Operation reboot of server4ha by <no-one> for crmd.4032 at server4ha.09ace2fb: No such device After server7 comes up the server4 requests itself to be stonith'd: -------------------------------------------------------------------------------------------------------- May 11 14:18:40 server4 corosync[4010]: [TOTEM ] A new membership ( 192.168.11.100:84) was formed. Members joined: 2 May 11 14:18:40 server4 corosync[4010]: [QUORUM] Members[2]: 1 2 May 11 14:18:40 server4 corosync[4010]: [MAIN ] Completed service synchronization, ready to provide service. May 11 14:18:40 server4 crmd[4032]: notice: Node server7ha state is now member May 11 14:18:40 server4 pacemakerd[4026]: notice: Node server7ha state is now member May 11 14:18:43 server4 crmd[4032]: notice: Syncing the Cluster Information Base from server7ha to rest of cluster May 11 14:18:45 server4 pengine[4031]: warning: Processing failed op stop for drbd_data:0 on server4ha: unknown error (1) May 11 14:18:45 server4 pengine[4031]: warning: Processing failed op stop for drbd_data:0 on server4ha: unknown error (1) May 11 14:18:45 server4 pengine[4031]: warning: Node server4ha will be fenced because of resource failure(s) May 11 14:18:45 server4 pengine[4031]: warning: Forcing drbd_data_clone away from server4ha after 1000000 failures (max=1) *May 11 14:18:45 server4 pengine[4031]: warning: Scheduling Node server4ha for STONITH* May 11 14:18:45 server4 pengine[4031]: notice: Stop of failed resource drbd_data:0 is implicit after server4ha is fenced May 11 14:18:45 server4 pengine[4031]: notice: Move vCluster-VirtualIP-10.168.10.199#011(Started server4ha -> server7ha) May 11 14:18:45 server4 pengine[4031]: notice: Start vCluster-Stonith-server4ha#011(server7ha) May 11 14:18:45 server4 pengine[4031]: notice: Stop vCluster-Stonith-server7ha#011(server4ha) May 11 14:18:45 server4 pengine[4031]: notice: Move dlm:0#011(Started server4ha -> server7ha) May 11 14:18:45 server4 pengine[4031]: notice: Move clvmd:0#011(Started server4ha -> server7ha) May 11 14:18:45 server4 pengine[4031]: notice: Stop drbd_data:0#011(server4ha) May 11 14:18:45 server4 pengine[4031]: notice: Start drbd_data:1#011(server7ha) May 11 14:18:45 server4 crmd[4032]: notice: Initiating monitor operation vCluster-VirtualIP-10.168.10.199_monitor_0 on server7ha May 11 14:18:45 server4 crmd[4032]: notice: Initiating monitor operation vCluster-Stonith-server4ha_monitor_0 on server7ha May 11 14:18:45 server4 crmd[4032]: notice: Initiating monitor operation vCluster-Stonith-server7ha_monitor_0 on server7ha May 11 14:18:45 server4 crmd[4032]: notice: Initiating monitor operation dlm_monitor_0 on server7ha May 11 14:18:45 server4 crmd[4032]: notice: Initiating monitor operation clvmd_monitor_0 on server7ha May 11 14:18:45 server4 crmd[4032]: notice: Initiating monitor operation drbd_data:1_monitor_0 on server7ha May 11 14:18:45 server4 crmd[4032]: notice: Requesting fencing (reboot) of node server4ha May 11 14:18:45 server4 stonith-ng[4028]: notice: Client crmd.4032.582f53c4 wants to fence (reboot) 'server4ha' with device '(any)' May 11 14:18:45 server4 stonith-ng[4028]: notice: Requesting peer fencing (reboot) of server4ha May 11 14:18:45 server4 crmd[4032]: notice: Initiating start operation vCluster-Stonith-server4ha_start_0 on server7ha May 11 14:18:45 server4 systemd-logind: Power key pressed. May 11 14:18:45 server4 systemd-logind: Powering Off... May 11 14:18:45 server4 systemd-logind: System is powering down. I am unable to understand that when DRBD is in Primary on both nodes then why it is demoted on surviving node when other node goes down: Is it to avoid split brain. But this demotion is causing issues because I want surviving node to remain Primary and not be demoted to secondary because GFS2 over DRBD cluster volume hosts my VM. DRBD Pacemaker integration commands used: -------------------------------------------------------------------------- pcs -f drbd_cfg resource create drbd_data ocf:linbit:drbd drbd_resource=${DRBD_RESOURCE_NAME} op monitor interval=60s pcs -f drbd_cfg resource master drbd_data_clone drbd_data master-max=2 master-node-max=1 clone-max=2 clone-node-max=1 \ notify=true interleave=true target-role=Started Behaviour I want: ------------------------------------- After Successfully creating Pacemaker 2 node cluster with dual-Primary DRBD together with cLVM/DLM/GFS2: 1) If any one node is down the other node remains UP with no disruption to DRBD and other resources. 2) After same crashed node is UP again and joins back Cluster then it should seamlessly join with no disruption to any of resources. Any ideas with Pacemaker and/or DRBD configuration to achieve this will be helpful. Thanks, Raman On Wed, May 10, 2017 at 7:01 PM, Lars Ellenberg <lars.ellenberg at linbit.com> wrote: > On Wed, May 10, 2017 at 02:07:45AM +0530, Raman Gupta wrote: > > Hi, > > > > In a Pacemaker 2 node cluster with dual-Primary DRBD(drbd84) with > > GFS2/DLM/CLVM setup following issue happens: > > > > Steps: > > --------- > > 1) Successfully created Pacemaker 2 node cluster with DRBD master/slave > > resources integrated. > > 2) Cluster nodes: server4 and server7 > > 3) The server4 node is rebooted. > > 4) When server4 comes Up the server7 is stonith'd and is lost! The node > > server4 survives. > > > > Problem: > > ----------- > > Problem is #4 above, when server4 comes up why server7 is stonith'd? > > > > From surviving node server4 the DRBD logs seems to be OK: DRBD has moved > to > > Connected/UpToDate state. Suddenly server7 is rebooted (stonithd/fenced) > > between time 00:47:35 <--> 00:47:42 in below logs. > > I don't think this has anything to do with DRBD, because: > > > > > /var/log/messages at server4 > > ------------------------------------------------ > > > May 10 00:47:41 server4 kernel: tg3 0000:02:00.1 em4: Link is down > > May 10 00:47:42 server4 kernel: tg3 0000:02:00.0 em3: Link is down > > May 10 00:47:42 server4 corosync[12570]: [TOTEM ] A processor failed, > > forming new configuration. > > May 10 00:47:43 server4 stonith-ng[12593]: notice: Operation 'reboot' > > [13018] (call 2 from crmd.13562) for host 'server7ha' with device > > 'vCluster-Stonith-server7ha' returned: 0 (OK) > > > There. > > Apparently, something downed the NICs for corosync communication. > Which then leads to fencing. > > Maybe you should double check your network configuration, > and any automagic reconfiguration of the network, > and only start corosync once your network is "stable"? > > > -- > : Lars Ellenberg > : LINBIT | Keeping the Digital World Running > : DRBD -- Heartbeat -- Corosync -- Pacemaker > > DRBD® and LINBIT® are registered trademarks of LINBIT > __ > please don't Cc me, but send to list -- I'm subscribed > _______________________________________________ > drbd-user mailing list > drbd-user at lists.linbit.com > http://lists.linbit.com/mailman/listinfo/drbd-user > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20170512/90bf90e6/attachment.htm>