Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On 01/07/15 02:50 PM, Muhammad Sharfuddin wrote: > On 07/01/2015 10:05 PM, Digimer wrote: >> You need to setup proper, working fencing. >> >> Configure stonith in pacemaker, test it by crashing each node (echo c > >> /proc/sysrq-trigger) and watch that they get rebooted. Once that works, >> configure DRBD to use 'fencing resource-and-stonith;' and use the >> 'crm-{un,}fence-peer.sh' fence/unfence handlers. >> >> digimer >> >> On 01/07/15 11:21 AM, Muhammad Sharfuddin wrote: >>> Hello, >>> First I setup a perfect dual primary setup, then I configured the >>> pacemaker cluster resource to start the drbd resource. As soon cluster >>> starts >>> the drbd resource splait-brain occurs, please let me know what I am >>> doing wrong. >>> >>> >>> Here is the drbd configuration: >>> >>> global_common.conf: >>> global { usage-count no; } >>> common { >>> handlers { >>> pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; >>> /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; >>> reboot -f"; >>> pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; >>> /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; >>> reboot -f"; >>> local-io-error "/usr/lib/drbd/notify-io-error.sh; >>> /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger >>> ; halt -f"; >>> split-brain "/usr/lib/drbd/notify-split-brain.sh root"; >>> } >>> >>> startup { wfc-timeout 0; degr-wfc-timeout 120; become-primary-on both; } >>> >>> disk { on-io-error detach; al-extents 3389; } >>> >>> net { >>> allow-two-primaries; after-sb-0pri discard-zero-changes; >>> after-sb-1pri discard-secondary; after-sb-2pri disconnect; >>> max-buffers 8000; max-epoch-size 8000; >>> sndbuf-size 0; verify-alg md5; >>> ping-int 2; ping-timeout 2; >>> connect-int 2; timeout 5; ko-count 5; >>> } >>> } >>> >>> r0.res: >>> resource r0 { >>> device /dev/drbd_r0 minor 0; >>> meta-disk internal; >>> on node1 { >>> disk "/dev/disk/by-path/pci-0000:00:10.0-scsi-0:0:1:0-part1"; >>> address 172.16.241.131:7780; >>> } >>> on node2 { >>> disk "/dev/disk/by-path/pci-0000:00:10.0-scsi-0:0:1:0-part1"; >>> address 172.16.241.132:7780; >>> } >>> syncer { rate 100M; } >>> } >>> >>> below are the cluster drbd resource configuration: >>> primitive p-drbd ocf:linbit:drbd \ >>> params drbd_resource="r0" \ >>> op monitor interval="50" role="Master" timeout="30" \ >>> op monitor interval="60" role="Slave" timeout="30" \ >>> op start interval="0" timeout="240" \ >>> op stop interval="0" timeout="100" >>> ms ms-drbd p-drbd \ >>> meta master-max="2" clone-max="2" notify="true" interleave="true" >>> >>> when cluster starts the drbd resource, /var/log/messages: >>> Jul 1 19:04:40 node2 cibadmin[4754]: notice: crm_log_args: Invoked: >>> cibadmin -p -R -o resources >>> Jul 1 19:04:41 node2 kernel: [ 494.932537] events: mcg drbd: 3 >>> Jul 1 19:04:41 node2 kernel: [ 494.943147] drbd: initialized. Version: >>> 8.4.3 (api:1/proto:86-101) >>> Jul 1 19:04:41 node2 kernel: [ 494.943151] drbd: GIT-hash: >>> 89a294209144b68adb3ee85a73221f964d3ee515 build by phil at fat-tyre, >>> 2013-02-05 15:35:49 >>> Jul 1 19:04:41 node2 kernel: [ 494.943153] drbd: registered as block >>> device major 147 >>> Jul 1 19:04:42 node2 kernel: [ 495.981244] d-con r0: Starting worker >>> thread (from drbdsetup [4801]) >>> Jul 1 19:04:42 node2 kernel: [ 495.981560] block drbd0: disk( Diskless >>> -> Attaching ) >>> Jul 1 19:04:42 node2 kernel: [ 495.982168] d-con r0: Method to ensure >>> write ordering: flush >>> Jul 1 19:04:42 node2 kernel: [ 495.982174] block drbd0: max BIO size = >>> 1048576 >>> Jul 1 19:04:42 node2 kernel: [ 495.982179] block drbd0: drbd_bm_resize >>> called with capacity == 4192056 >>> Jul 1 19:04:42 node2 kernel: [ 495.982201] block drbd0: resync bitmap: >>> bits=524007 words=8188 pages=16 >>> Jul 1 19:04:42 node2 kernel: [ 495.982204] block drbd0: size = 2047 MB >>> (2096028 KB) >>> Jul 1 19:04:42 node2 kernel: [ 495.983736] block drbd0: bitmap READ of >>> 16 pages took 1 jiffies >>> Jul 1 19:04:42 node2 kernel: [ 495.983757] block drbd0: recounting of >>> set bits took additional 0 jiffies >>> Jul 1 19:04:42 node2 kernel: [ 495.983760] block drbd0: 0 KB (0 bits) >>> marked out-of-sync by on disk bit-map. >>> Jul 1 19:04:42 node2 kernel: [ 495.983767] block drbd0: disk( >>> Attaching -> UpToDate ) >>> Jul 1 19:04:42 node2 kernel: [ 495.983771] block drbd0: attached to >>> UUIDs 62EE6E5BA23AC477:37CECFD41B2C30A4:1B8441319CED9865:1B8341319CED9865 >>> Jul 1 19:04:42 node2 attrd[4231]: notice: attrd_trigger_update: >>> Sending flush op to all hosts for: master-p-drbd (1000) >>> Jul 1 19:04:42 node2 attrd[4231]: notice: attrd_perform_update: Sent >>> update 24: master-p-drbd=1000 >>> Jul 1 19:04:42 node2 attrd[4231]: notice: attrd_perform_update: Sent >>> update 27: master-p-drbd=1000 >>> Jul 1 19:04:42 node2 crmd[4233]: notice: process_lrm_event: LRM >>> operation p-drbd_start_0 (call=68, rc=0, cib-update=18, confirmed=true) ok >>> Jul 1 19:04:42 node2 kernel: [ 495.993653] d-con r0: conn( StandAlone >>> -> Unconnected ) >>> Jul 1 19:04:42 node2 kernel: [ 496.044937] d-con r0: Starting receiver >>> thread (from drbd_w_r0 [4802]) >>> Jul 1 19:04:42 node2 kernel: [ 496.045820] d-con r0: receiver (re)started >>> Jul 1 19:04:42 node2 kernel: [ 496.045830] d-con r0: conn( Unconnected >>> -> WFConnection ) >>> Jul 1 19:04:42 node2 crmd[4233]: notice: process_lrm_event: LRM >>> operation p-drbd_notify_0 (call=71, rc=0, cib-update=0, confirmed=true) ok >>> Jul 1 19:04:42 node2 crmd[4233]: notice: process_lrm_event: LRM >>> operation p-drbd_notify_0 (call=74, rc=0, cib-update=0, confirmed=true) ok >>> Jul 1 19:04:42 node2 crmd[4233]: notice: process_lrm_event: LRM >>> operation p-drbd_promote_0 (call=77, rc=0, cib-update=19, confirmed=true) ok >>> Jul 1 19:04:42 node2 kernel: [ 496.197480] block drbd0: role( >>> Secondary -> Primary ) >>> Jul 1 19:04:42 node2 attrd[4231]: notice: attrd_trigger_update: >>> Sending flush op to all hosts for: master-p-drbd (10000) >>> Jul 1 19:04:42 node2 attrd[4231]: notice: attrd_perform_update: Sent >>> update 31: master-p-drbd=10000 >>> Jul 1 19:04:42 node2 crmd[4233]: notice: process_lrm_event: LRM >>> operation p-drbd_notify_0 (call=80, rc=0, cib-update=0, confirmed=true) ok >>> Jul 1 19:04:42 node2 crmd[4233]: notice: process_lrm_event: LRM >>> operation p-drbd_monitor_50000 (call=83, rc=8, cib-update=20, >>> confirmed=false) master >>> Jul 1 19:04:42 node2 crmd[4233]: notice: process_lrm_event: >>> node2-p-drbd_monitor_50000:83 [ ] >>> Jul 1 19:04:42 node2 kernel: [ 496.342704] d-con r0: Handshake >>> successful: Agreed network protocol version 101 >>> Jul 1 19:04:42 node2 kernel: [ 496.342890] d-con r0: conn( >>> WFConnection -> WFReportParams ) >>> Jul 1 19:04:42 node2 kernel: [ 496.342893] d-con r0: Starting asender >>> thread (from drbd_r_r0 [4821]) >>> Jul 1 19:04:42 node2 kernel: [ 496.356028] block drbd0: >>> drbd_sync_handshake: >>> Jul 1 19:04:42 node2 kernel: [ 496.356033] block drbd0: self >>> 62EE6E5BA23AC477:37CECFD41B2C30A4:1B8441319CED9865:1B8341319CED9865 >>> bits:0 flags:0 >>> Jul 1 19:04:42 node2 kernel: [ 496.356035] block drbd0: peer >>> 20FA2D65F94F24B7:37CECFD41B2C30A5:1B8441319CED9865:1B8341319CED9865 >>> bits:0 flags:0 >>> Jul 1 19:04:42 node2 kernel: [ 496.356038] block drbd0: >>> uuid_compare()=100 by rule 90 >>> Jul 1 19:04:42 node2 kernel: [ 496.356041] block drbd0: helper >>> command: /sbin/drbdadm initial-split-brain minor-0 >>> Jul 1 19:04:42 node2 kernel: [ 496.358760] block drbd0: helper >>> command: /sbin/drbdadm initial-split-brain minor-0 exit code 0 (0x0) >>> Jul 1 19:04:42 node2 kernel: [ 496.358776] block drbd0: Split-Brain >>> detected but unresolved, dropping connection! >>> Jul 1 19:04:42 node2 kernel: [ 496.358811] block drbd0: helper >>> command: /sbin/drbdadm split-brain minor-0 >>> Jul 1 19:04:42 node2 notify-split-brain.sh[4966]: invoked for r0/0 (drbd0) >>> Jul 1 19:04:42 node2 kernel: [ 496.385210] d-con r0: meta connection >>> shut down by peer. >>> Jul 1 19:04:42 node2 kernel: [ 496.385225] d-con r0: conn( >>> WFReportParams -> NetworkFailure ) >>> Jul 1 19:04:42 node2 kernel: [ 496.385228] d-con r0: asender terminated >>> Jul 1 19:04:42 node2 kernel: [ 496.385229] d-con r0: Terminating drbd_a_r0 >>> Jul 1 19:04:42 node2 kernel: [ 496.389939] block drbd0: helper >>> command: /sbin/drbdadm split-brain minor-0 exit code 0 (0x0) >>> Jul 1 19:04:42 node2 kernel: [ 496.389961] d-con r0: conn( >>> NetworkFailure -> Disconnecting ) >>> Jul 1 19:04:42 node2 kernel: [ 496.389964] d-con r0: error receiving >>> ReportState, e: -5 l: 0! >>> Jul 1 19:04:42 node2 kernel: [ 496.390147] d-con r0: Connection closed >>> Jul 1 19:04:42 node2 kernel: [ 496.390174] d-con r0: conn( >>> Disconnecting -> StandAlone ) >>> Jul 1 19:04:42 node2 kernel: [ 496.390176] d-con r0: receiver terminated >>> Jul 1 19:04:42 node2 kernel: [ 496.390177] d-con r0: Terminating drbd_r_r0 >>> >>> >>> -- >>> Regards, >>> >>> Muhammad Sharfuddin >>> >>> >>> >>> _______________________________________________ >>> drbd-user mailing list >>> drbd-user at lists.linbit.com >>> http://lists.linbit.com/mailman/listinfo/drbd-user >>> >> > Thanks for quick help, sbd stonith already configured properly, and per > your recommendation > I googled for the handler and found following working perfectly: > > fence-peer "/usr/lib/drbd/crm-fence-peer.sh"; > after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh"; > > Thanks a lot ;-) > > -- > Regards, > > Muhammad Sharfuddin Glad it helped, cheers -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education?