Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi all, I'd like to know if someone know a tip to block the Raid controler or blocking the I/O ? I'd like to reproduce our problem to check if the ko-count fix the problem. Thanks for your help Matthieu Le 10/03/14 09:44, Matthieu Lejeune a écrit : > Hi, > > Thanks for you reply. > > If I modify the configuration like this on the global_common : > > global { > usage-count yes; > # minor-count dialog-refresh disable-ip-verification > } > common { > protocol C; > handlers { > # The following 3 handlers were disabled due to #576511. > # Please check the DRBD manual and enable them, if > they make sense in your setup. > # pri-on-incon-degr > "/usr/lib/drbd/notify-pri-on-incon-degr.sh; > /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger > ; reboot -f"; > # pri-lost-after-sb > "/usr/lib/drbd/notify-pri-lost-after-sb.sh; > /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger > ; reboot -f"; > # local-io-error "/usr/lib/drbd/notify-io-error.sh; > /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > > /proc/sysrq-trigger ; halt -f"; > # fence-peer "/usr/lib/drbd/crm-fence-peer.sh"; > # split-brain "/usr/lib/drbd/notify-split-brain.sh root"; > # out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root"; > # before-resync-target > "/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k"; > # after-resync-target > /usr/lib/drbd/unsnapshot-resync-target-lvm.sh; > } > startup { > # wfc-timeout degr-wfc-timeout outdated-wfc-timeout > wait-after-sb > } > disk { > # on-io-error fencing use-bmbv no-disk-barrier > no-disk-flushes > # no-disk-drain no-md-flushes max-bio-bvecs > } > net { > ko-count 2 > timeout 50 > # sndbuf-size rcvbuf-size timeout connect-int ping-int > ping-timeout max-buffers > # max-epoch-size ko-count allow-two-primaries > cram-hmac-alg shared-secret > # after-sb-0pri after-sb-1pri after-sb-2pri > data-integrity-alg no-tcp-cork > } > > syncer { > # rate after al-extents use-rle cpu-mask verify-alg > csums-alg > } > } > > If I make this config one the secondary node, I can have a proper > disconnection on the slave when we ave HW problems like on my previous > post ? > > Thanks > > Matthieu Lejeune > > > > > Le 5/03/14 11:32, Philip Gaw a écrit : >>> Hi Matthieu, >>> >>> On 05/03/2014 07:29, Matthieu Lejeune wrote: >>>> Hi all, >>>> >>>> I had a problem this night with a DRBD Primary/Slave. >>>> >>>> >>>> The slave experienced a hardware issue (LSI controller freezed). >>>> It seems the master hold I/O waiting for the slave to respond until >>>> timeout. >>>> >>>> >>>> This caused all targets exported trough infiniband to be >>>> disconnected from the master. >>>> >>>> >>>> So, practically, the master stop responding due to a failure on the >>>> slave. >>>> >>>> I had to hard reboot (power cycle) the slave because UDEV wasn't >>>> responding and did not allow normal reboot. >>>> After slave reboot, drdb did reconnect. It was in status pri/sec >>>> uptodate/uptodate. >>>> But the LSI controller immediatly timeout causing the same issue a >>>> second time. >>>> >>>> >>>> How can we prevent issue on the slave to impact the master ? >>>> >>> have a look at ko-count >>> >>> |ko-count/|number|/| >>> >>> In case the secondary node fails to complete a single write >>> request for/|count|/times the/|timeout|/, it is expelled from >>> the cluster. (I.e. the primary node goes into|StandAlone|mode.) >>> The default value is 0, which disables this feature. >>> >>> >>> http://www.drbd.org/users-guide/re-drbdconf.html >>> >>>> >>>> Thank you. >>>> Matthieu Lejeune >>>> >>>> >>>> drbd8-utils : 2:8.3.13-2 amd64 RAID >>>> 1 over tcp/ip for Linux utilities >>>> Debian : >>>> root at ifprdstor8a:~/trunk# cat /proc/version >>>> Linux version 3.2.0-4-amd64 (debian-kernel at lists.debian.org) (gcc >>>> version 4.6.3 (Debian 4.6.3-14) ) #1 SMP Debian 3.2.51-1 >>>> root at ifprdstor8a:~/trunk# >>>> >>>> We are using the scst/srpt with the Trunk version of the 7 January 2014 >>>> >>>> I give you the config : >>>> *drbd global : ** >>>> * >>>> root at ifprdstor8a:/etc/drbd.d# cat global_common.conf >>>> global { >>>> usage-count yes; >>>> # minor-count dialog-refresh disable-ip-verification >>>> } >>>> >>>> common { >>>> protocol C; >>>> >>>> handlers { >>>> # The following 3 handlers were disabled due to #576511. >>>> # Please check the DRBD manual and enable them, if they >>>> make sense in your setup. >>>> # pri-on-incon-degr >>>> "/usr/lib/drbd/notify-pri-on-incon-degr.sh; >>>> /usr/lib/drbd/notify-emergency-reboot.sh; echo b > >>>> /proc/sysrq-trigger ; reboot -f"; >>>> # pri-lost-after-sb >>>> "/usr/lib/drbd/notify-pri-lost-after-sb.sh; >>>> /usr/lib/drbd/notify-emergency-reboot.sh; echo b > >>>> /proc/sysrq-trigger ; reboot -f"; >>>> # local-io-error "/usr/lib/drbd/notify-io-error.sh; >>>> /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > >>>> /proc/sysrq-trigger ; halt -f"; >>>> >>>> # fence-peer "/usr/lib/drbd/crm-fence-peer.sh"; >>>> # split-brain "/usr/lib/drbd/notify-split-brain.sh root"; >>>> # out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root"; >>>> # before-resync-target >>>> "/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k"; >>>> # after-resync-target >>>> /usr/lib/drbd/unsnapshot-resync-target-lvm.sh; >>>> } >>>> >>>> startup { >>>> # wfc-timeout degr-wfc-timeout outdated-wfc-timeout >>>> wait-after-sb >>>> } >>>> >>>> disk { >>>> # on-io-error fencing use-bmbv no-disk-barrier no-disk-flushes >>>> # no-disk-drain no-md-flushes max-bio-bvecs >>>> } >>>> >>>> net { >>>> # sndbuf-size rcvbuf-size timeout connect-int ping-int >>>> ping-timeout max-buffers >>>> # max-epoch-size ko-count allow-two-primaries cram-hmac-alg >>>> shared-secret >>>> # after-sb-0pri after-sb-1pri after-sb-2pri >>>> data-integrity-alg no-tcp-cork >>>> } >>>> >>>> syncer { >>>> # rate after al-extents use-rle cpu-mask verify-alg csums-alg >>>> } >>>> } >>>> >>>> *Ressource Configuration : * >>>> >>>> root at ifprdstor8a:/etc/drbd.d# cat DSA801.res >>>> resource DSA801 { >>>> protocol C; >>>> >>>> startup { >>>> wfc-timeout 0; >>>> } >>>> >>>> disk { >>>> on-io-error detach; >>>> } >>>> >>>> syncer { >>>> rate 400M; >>>> verify-alg md5; >>>> } >>>> >>>> on ifprdstor8a { >>>> device /dev/drbd1; >>>> disk /dev/sda; >>>> address 10.13.1.5:7788; >>>> meta-disk internal; >>>> } >>>> >>>> on ifprdstor8b { >>>> device /dev/drbd1; >>>> disk /dev/sda; >>>> address 10.13.1.6:7788; >>>> meta-disk internal; >>>> } >>>> } >>>> >> >> >> >> _______________________________________________ >> drbd-user mailing list >> drbd-user at lists.linbit.com >> http://lists.linbit.com/mailman/listinfo/drbd-user > > > > > _______________________________________________ > drbd-user mailing list > drbd-user at lists.linbit.com > http://lists.linbit.com/mailman/listinfo/drbd-user -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20140311/eedaf693/attachment.htm>