[DRBD-user] DRBD Master crash on slave HW problem

Tue Mar 11 11:02:29 CET 2014

Hi all,

I'd like to know if someone know a tip to block the Raid controler or 
blocking the I/O ?

I'd like to reproduce our problem to check if the ko-count fix the problem.

Thanks for your help

Matthieu

Le 10/03/14 09:44, Matthieu Lejeune a écrit :
> Hi,
>
> Thanks for you reply.
>
> If I modify the configuration like this on the global_common :
>
> global {
>         usage-count yes;
>         # minor-count dialog-refresh disable-ip-verification
> }
> common {
>         protocol C;
>         handlers {
>                 # The following 3 handlers were disabled due to #576511.
>                 # Please check the DRBD manual and enable them, if 
> they make sense in your setup.
>                 # pri-on-incon-degr 
> "/usr/lib/drbd/notify-pri-on-incon-degr.sh; 
> /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger 
> ; reboot -f";
>                 # pri-lost-after-sb 
> "/usr/lib/drbd/notify-pri-lost-after-sb.sh; 
> /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger 
> ; reboot -f";
>                 # local-io-error "/usr/lib/drbd/notify-io-error.sh; 
> /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > 
> /proc/sysrq-trigger ; halt -f";
>                 # fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
>                 # split-brain "/usr/lib/drbd/notify-split-brain.sh root";
>                 # out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";
>                 # before-resync-target 
> "/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k";
>                 # after-resync-target 
> /usr/lib/drbd/unsnapshot-resync-target-lvm.sh;
>         }
>         startup {
>                 # wfc-timeout degr-wfc-timeout outdated-wfc-timeout 
> wait-after-sb
>         }
>         disk {
>                 # on-io-error fencing use-bmbv no-disk-barrier 
> no-disk-flushes
>                 # no-disk-drain no-md-flushes max-bio-bvecs
>         }
>         net {
>                 ko-count 2
>                 timeout 50
>                 # sndbuf-size rcvbuf-size timeout connect-int ping-int 
> ping-timeout max-buffers
>                 # max-epoch-size ko-count allow-two-primaries 
> cram-hmac-alg shared-secret
>                 # after-sb-0pri after-sb-1pri after-sb-2pri 
> data-integrity-alg no-tcp-cork
>         }
>
>         syncer {
>                 # rate after al-extents use-rle cpu-mask verify-alg 
> csums-alg
>         }
> }
>
> If I make this config one the secondary node, I can have a proper 
> disconnection on the slave when we ave HW problems like on my previous 
> post ?
>
> Thanks
>
> Matthieu Lejeune
>
>
>
>
> Le 5/03/14 11:32, Philip Gaw a écrit :
>>> Hi Matthieu,
>>>
>>> On 05/03/2014 07:29, Matthieu Lejeune wrote:
>>>> Hi all,
>>>>
>>>> I had a problem this night with a DRBD Primary/Slave.
>>>>
>>>>
>>>> The slave experienced a hardware issue (LSI controller freezed).
>>>> It seems the master hold I/O waiting for the slave to respond until 
>>>> timeout.
>>>>
>>>>
>>>> This caused all targets exported trough infiniband to be 
>>>> disconnected from the master.
>>>>
>>>>
>>>> So, practically, the master stop responding due to a failure on the 
>>>> slave.
>>>>
>>>> I had to hard reboot (power cycle) the slave because UDEV wasn't 
>>>> responding and did not allow normal reboot.
>>>> After slave reboot, drdb did reconnect. It was in status pri/sec 
>>>> uptodate/uptodate.
>>>> But the LSI controller immediatly timeout causing the same issue a 
>>>> second time.
>>>>
>>>>
>>>> How can we prevent issue on the slave to impact the master ?
>>>>
>>> have a look at ko-count
>>>
>>> |ko-count/|number|/|
>>>
>>>     In case the secondary node fails to complete a single write
>>>     request for/|count|/times the/|timeout|/, it is expelled from
>>>     the cluster. (I.e. the primary node goes into|StandAlone|mode.)
>>>     The default value is 0, which disables this feature.
>>>
>>>
>>> http://www.drbd.org/users-guide/re-drbdconf.html
>>>
>>>>
>>>> Thank you.
>>>> Matthieu Lejeune
>>>>
>>>>
>>>> drbd8-utils : 2:8.3.13-2              amd64                   RAID 
>>>> 1 over tcp/ip for Linux utilities
>>>> Debian :
>>>> root at ifprdstor8a:~/trunk# cat /proc/version
>>>> Linux version 3.2.0-4-amd64 (debian-kernel at lists.debian.org) (gcc 
>>>> version 4.6.3 (Debian 4.6.3-14) ) #1 SMP Debian 3.2.51-1
>>>> root at ifprdstor8a:~/trunk#
>>>>
>>>> We are using the scst/srpt with the Trunk version of the 7 January 2014
>>>>
>>>> I give you the config :
>>>> *drbd global : **
>>>> *
>>>> root at ifprdstor8a:/etc/drbd.d# cat global_common.conf
>>>> global {
>>>>     usage-count yes;
>>>>     # minor-count dialog-refresh disable-ip-verification
>>>> }
>>>>
>>>> common {
>>>>     protocol C;
>>>>
>>>>     handlers {
>>>>         # The following 3 handlers were disabled due to #576511.
>>>>         # Please check the DRBD manual and enable them, if they 
>>>> make sense in your setup.
>>>>         # pri-on-incon-degr 
>>>> "/usr/lib/drbd/notify-pri-on-incon-degr.sh; 
>>>> /usr/lib/drbd/notify-emergency-reboot.sh; echo b > 
>>>> /proc/sysrq-trigger ; reboot -f";
>>>>         # pri-lost-after-sb 
>>>> "/usr/lib/drbd/notify-pri-lost-after-sb.sh; 
>>>> /usr/lib/drbd/notify-emergency-reboot.sh; echo b > 
>>>> /proc/sysrq-trigger ; reboot -f";
>>>>         # local-io-error "/usr/lib/drbd/notify-io-error.sh; 
>>>> /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > 
>>>> /proc/sysrq-trigger ; halt -f";
>>>>
>>>>         # fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
>>>>         # split-brain "/usr/lib/drbd/notify-split-brain.sh root";
>>>>         # out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";
>>>>         # before-resync-target 
>>>> "/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k";
>>>>         # after-resync-target 
>>>> /usr/lib/drbd/unsnapshot-resync-target-lvm.sh;
>>>>     }
>>>>
>>>>     startup {
>>>>         # wfc-timeout degr-wfc-timeout outdated-wfc-timeout 
>>>> wait-after-sb
>>>>     }
>>>>
>>>>     disk {
>>>>         # on-io-error fencing use-bmbv no-disk-barrier no-disk-flushes
>>>>         # no-disk-drain no-md-flushes max-bio-bvecs
>>>>     }
>>>>
>>>>     net {
>>>>         # sndbuf-size rcvbuf-size timeout connect-int ping-int 
>>>> ping-timeout max-buffers
>>>>         # max-epoch-size ko-count allow-two-primaries cram-hmac-alg 
>>>> shared-secret
>>>>         # after-sb-0pri after-sb-1pri after-sb-2pri 
>>>> data-integrity-alg no-tcp-cork
>>>>     }
>>>>
>>>>     syncer {
>>>>         # rate after al-extents use-rle cpu-mask verify-alg csums-alg
>>>>     }
>>>> }
>>>>
>>>> *Ressource Configuration : *
>>>>
>>>> root at ifprdstor8a:/etc/drbd.d# cat DSA801.res
>>>> resource DSA801 {
>>>>   protocol C;
>>>>
>>>>   startup {
>>>>     wfc-timeout 0;
>>>>   }
>>>>
>>>>   disk {
>>>>     on-io-error detach;
>>>>   }
>>>>
>>>>   syncer {
>>>>     rate 400M;
>>>>     verify-alg md5;
>>>>   }
>>>>
>>>>   on ifprdstor8a {
>>>>     device    /dev/drbd1;
>>>>     disk      /dev/sda;
>>>>     address   10.13.1.5:7788;
>>>>     meta-disk internal;
>>>>   }
>>>>
>>>>   on ifprdstor8b {
>>>>     device    /dev/drbd1;
>>>>     disk      /dev/sda;
>>>>     address   10.13.1.6:7788;
>>>>     meta-disk internal;
>>>>   }
>>>> }
>>>>
>>
>>
>>
>> _______________________________________________
>> drbd-user mailing list
>> drbd-user at lists.linbit.com
>> http://lists.linbit.com/mailman/listinfo/drbd-user
>
>
>
>
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20140311/eedaf693/attachment.htm>