[DRBD-user] split-brain occurs as soon cluster starts the drbd resource

Thu Jul 2 00:53:32 CEST 2015

On 01/07/15 02:50 PM, Muhammad Sharfuddin wrote:
> On 07/01/2015 10:05 PM, Digimer wrote:
>> You need to setup proper, working fencing.
>>
>> Configure stonith in pacemaker, test it by crashing each node (echo c >
>> /proc/sysrq-trigger) and watch that they get rebooted. Once that works,
>> configure DRBD to use 'fencing resource-and-stonith;' and use the
>> 'crm-{un,}fence-peer.sh' fence/unfence handlers.
>>
>> digimer
>>
>> On 01/07/15 11:21 AM, Muhammad Sharfuddin wrote:
>>> Hello,
>>> First I setup a perfect dual primary setup, then I configured the
>>> pacemaker cluster resource to start the drbd resource. As soon cluster
>>> starts
>>> the drbd resource splait-brain occurs, please let me know what I am
>>> doing wrong.
>>>
>>>
>>> Here is the drbd configuration:
>>>
>>> global_common.conf:
>>> global { usage-count no; }
>>> common {
>>>     handlers {
>>>          pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh;
>>> /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ;
>>> reboot -f";
>>>          pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh;
>>> /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ;
>>> reboot -f";
>>>          local-io-error "/usr/lib/drbd/notify-io-error.sh;
>>> /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger
>>> ; halt -f";
>>>          split-brain "/usr/lib/drbd/notify-split-brain.sh root";
>>>         }
>>>
>>>     startup { wfc-timeout 0; degr-wfc-timeout 120; become-primary-on both; }
>>>
>>>     disk { on-io-error detach; al-extents 3389; }
>>>
>>>     net {
>>>                allow-two-primaries; after-sb-0pri discard-zero-changes;
>>>            after-sb-1pri discard-secondary; after-sb-2pri disconnect;
>>>            max-buffers 8000; max-epoch-size 8000;
>>>            sndbuf-size 0; verify-alg md5;
>>>            ping-int 2; ping-timeout 2;
>>>            connect-int 2; timeout 5; ko-count 5;
>>>     }
>>> }
>>>
>>> r0.res:
>>> resource r0 {
>>>   device /dev/drbd_r0 minor 0;
>>>   meta-disk internal;
>>>   on node1  {
>>>     disk "/dev/disk/by-path/pci-0000:00:10.0-scsi-0:0:1:0-part1";
>>>     address 172.16.241.131:7780;
>>>   }
>>>   on node2 {
>>>     disk "/dev/disk/by-path/pci-0000:00:10.0-scsi-0:0:1:0-part1";
>>>     address 172.16.241.132:7780;
>>>   }
>>>   syncer { rate 100M; }
>>> }
>>>
>>> below are the cluster drbd resource configuration:
>>> primitive p-drbd ocf:linbit:drbd \
>>>         params drbd_resource="r0" \
>>>         op monitor interval="50" role="Master" timeout="30" \
>>>         op monitor interval="60" role="Slave" timeout="30" \
>>>         op start interval="0" timeout="240" \
>>>         op stop interval="0" timeout="100"
>>> ms ms-drbd p-drbd \
>>>         meta master-max="2" clone-max="2" notify="true" interleave="true"
>>>
>>> when cluster starts the drbd resource, /var/log/messages:
>>> Jul  1 19:04:40 node2 cibadmin[4754]:   notice: crm_log_args: Invoked:
>>> cibadmin -p -R -o resources
>>> Jul  1 19:04:41 node2 kernel: [  494.932537] events: mcg drbd: 3
>>> Jul  1 19:04:41 node2 kernel: [  494.943147] drbd: initialized. Version:
>>> 8.4.3 (api:1/proto:86-101)
>>> Jul  1 19:04:41 node2 kernel: [  494.943151] drbd: GIT-hash:
>>> 89a294209144b68adb3ee85a73221f964d3ee515 build by phil at fat-tyre,
>>> 2013-02-05 15:35:49
>>> Jul  1 19:04:41 node2 kernel: [  494.943153] drbd: registered as block
>>> device major 147
>>> Jul  1 19:04:42 node2 kernel: [  495.981244] d-con r0: Starting worker
>>> thread (from drbdsetup [4801])
>>> Jul  1 19:04:42 node2 kernel: [  495.981560] block drbd0: disk( Diskless
>>> -> Attaching )
>>> Jul  1 19:04:42 node2 kernel: [  495.982168] d-con r0: Method to ensure
>>> write ordering: flush
>>> Jul  1 19:04:42 node2 kernel: [  495.982174] block drbd0: max BIO size =
>>> 1048576
>>> Jul  1 19:04:42 node2 kernel: [  495.982179] block drbd0: drbd_bm_resize
>>> called with capacity == 4192056
>>> Jul  1 19:04:42 node2 kernel: [  495.982201] block drbd0: resync bitmap:
>>> bits=524007 words=8188 pages=16
>>> Jul  1 19:04:42 node2 kernel: [  495.982204] block drbd0: size = 2047 MB
>>> (2096028 KB)
>>> Jul  1 19:04:42 node2 kernel: [  495.983736] block drbd0: bitmap READ of
>>> 16 pages took 1 jiffies
>>> Jul  1 19:04:42 node2 kernel: [  495.983757] block drbd0: recounting of
>>> set bits took additional 0 jiffies
>>> Jul  1 19:04:42 node2 kernel: [  495.983760] block drbd0: 0 KB (0 bits)
>>> marked out-of-sync by on disk bit-map.
>>> Jul  1 19:04:42 node2 kernel: [  495.983767] block drbd0: disk(
>>> Attaching -> UpToDate )
>>> Jul  1 19:04:42 node2 kernel: [  495.983771] block drbd0: attached to
>>> UUIDs 62EE6E5BA23AC477:37CECFD41B2C30A4:1B8441319CED9865:1B8341319CED9865
>>> Jul  1 19:04:42 node2 attrd[4231]:   notice: attrd_trigger_update:
>>> Sending flush op to all hosts for: master-p-drbd (1000)
>>> Jul  1 19:04:42 node2 attrd[4231]:   notice: attrd_perform_update: Sent
>>> update 24: master-p-drbd=1000
>>> Jul  1 19:04:42 node2 attrd[4231]:   notice: attrd_perform_update: Sent
>>> update 27: master-p-drbd=1000
>>> Jul  1 19:04:42 node2 crmd[4233]:   notice: process_lrm_event: LRM
>>> operation p-drbd_start_0 (call=68, rc=0, cib-update=18, confirmed=true) ok
>>> Jul  1 19:04:42 node2 kernel: [  495.993653] d-con r0: conn( StandAlone
>>> -> Unconnected )
>>> Jul  1 19:04:42 node2 kernel: [  496.044937] d-con r0: Starting receiver
>>> thread (from drbd_w_r0 [4802])
>>> Jul  1 19:04:42 node2 kernel: [  496.045820] d-con r0: receiver (re)started
>>> Jul  1 19:04:42 node2 kernel: [  496.045830] d-con r0: conn( Unconnected
>>> -> WFConnection )
>>> Jul  1 19:04:42 node2 crmd[4233]:   notice: process_lrm_event: LRM
>>> operation p-drbd_notify_0 (call=71, rc=0, cib-update=0, confirmed=true) ok
>>> Jul  1 19:04:42 node2 crmd[4233]:   notice: process_lrm_event: LRM
>>> operation p-drbd_notify_0 (call=74, rc=0, cib-update=0, confirmed=true) ok
>>> Jul  1 19:04:42 node2 crmd[4233]:   notice: process_lrm_event: LRM
>>> operation p-drbd_promote_0 (call=77, rc=0, cib-update=19, confirmed=true) ok
>>> Jul  1 19:04:42 node2 kernel: [  496.197480] block drbd0: role(
>>> Secondary -> Primary )
>>> Jul  1 19:04:42 node2 attrd[4231]:   notice: attrd_trigger_update:
>>> Sending flush op to all hosts for: master-p-drbd (10000)
>>> Jul  1 19:04:42 node2 attrd[4231]:   notice: attrd_perform_update: Sent
>>> update 31: master-p-drbd=10000
>>> Jul  1 19:04:42 node2 crmd[4233]:   notice: process_lrm_event: LRM
>>> operation p-drbd_notify_0 (call=80, rc=0, cib-update=0, confirmed=true) ok
>>> Jul  1 19:04:42 node2 crmd[4233]:   notice: process_lrm_event: LRM
>>> operation p-drbd_monitor_50000 (call=83, rc=8, cib-update=20,
>>> confirmed=false) master
>>> Jul  1 19:04:42 node2 crmd[4233]:   notice: process_lrm_event:
>>> node2-p-drbd_monitor_50000:83 [   ]
>>> Jul  1 19:04:42 node2 kernel: [  496.342704] d-con r0: Handshake
>>> successful: Agreed network protocol version 101
>>> Jul  1 19:04:42 node2 kernel: [  496.342890] d-con r0: conn(
>>> WFConnection -> WFReportParams )
>>> Jul  1 19:04:42 node2 kernel: [  496.342893] d-con r0: Starting asender
>>> thread (from drbd_r_r0 [4821])
>>> Jul  1 19:04:42 node2 kernel: [  496.356028] block drbd0:
>>> drbd_sync_handshake:
>>> Jul  1 19:04:42 node2 kernel: [  496.356033] block drbd0: self
>>> 62EE6E5BA23AC477:37CECFD41B2C30A4:1B8441319CED9865:1B8341319CED9865
>>> bits:0 flags:0
>>> Jul  1 19:04:42 node2 kernel: [  496.356035] block drbd0: peer
>>> 20FA2D65F94F24B7:37CECFD41B2C30A5:1B8441319CED9865:1B8341319CED9865
>>> bits:0 flags:0
>>> Jul  1 19:04:42 node2 kernel: [  496.356038] block drbd0:
>>> uuid_compare()=100 by rule 90
>>> Jul  1 19:04:42 node2 kernel: [  496.356041] block drbd0: helper
>>> command: /sbin/drbdadm initial-split-brain minor-0
>>> Jul  1 19:04:42 node2 kernel: [  496.358760] block drbd0: helper
>>> command: /sbin/drbdadm initial-split-brain minor-0 exit code 0 (0x0)
>>> Jul  1 19:04:42 node2 kernel: [  496.358776] block drbd0: Split-Brain
>>> detected but unresolved, dropping connection!
>>> Jul  1 19:04:42 node2 kernel: [  496.358811] block drbd0: helper
>>> command: /sbin/drbdadm split-brain minor-0
>>> Jul  1 19:04:42 node2 notify-split-brain.sh[4966]: invoked for r0/0 (drbd0)
>>> Jul  1 19:04:42 node2 kernel: [  496.385210] d-con r0: meta connection
>>> shut down by peer.
>>> Jul  1 19:04:42 node2 kernel: [  496.385225] d-con r0: conn(
>>> WFReportParams -> NetworkFailure )
>>> Jul  1 19:04:42 node2 kernel: [  496.385228] d-con r0: asender terminated
>>> Jul  1 19:04:42 node2 kernel: [  496.385229] d-con r0: Terminating drbd_a_r0
>>> Jul  1 19:04:42 node2 kernel: [  496.389939] block drbd0: helper
>>> command: /sbin/drbdadm split-brain minor-0 exit code 0 (0x0)
>>> Jul  1 19:04:42 node2 kernel: [  496.389961] d-con r0: conn(
>>> NetworkFailure -> Disconnecting )
>>> Jul  1 19:04:42 node2 kernel: [  496.389964] d-con r0: error receiving
>>> ReportState, e: -5 l: 0!
>>> Jul  1 19:04:42 node2 kernel: [  496.390147] d-con r0: Connection closed
>>> Jul  1 19:04:42 node2 kernel: [  496.390174] d-con r0: conn(
>>> Disconnecting -> StandAlone )
>>> Jul  1 19:04:42 node2 kernel: [  496.390176] d-con r0: receiver terminated
>>> Jul  1 19:04:42 node2 kernel: [  496.390177] d-con r0: Terminating drbd_r_r0
>>>
>>>
>>> -- 
>>> Regards,
>>>
>>> Muhammad Sharfuddin
>>>
>>>
>>>
>>> _______________________________________________
>>> drbd-user mailing list
>>> drbd-user at lists.linbit.com
>>> http://lists.linbit.com/mailman/listinfo/drbd-user
>>>
>>
> Thanks for quick help, sbd stonith already configured properly, and per
> your recommendation
> I googled for the handler and found following working perfectly:
> 
> fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
> after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
> 
> Thanks a lot ;-)
> 
> --
> Regards,
> 
> Muhammad Sharfuddin

Glad it helped, cheers

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?