[DRBD-user] DRBD Master crash on slave HW problem

Mon Mar 10 09:44:59 CET 2014

Hi,

Thanks for you reply.

If I modify the configuration like this on the global_common :

global {
         usage-count yes;
         # minor-count dialog-refresh disable-ip-verification
}
common {
         protocol C;
         handlers {
                 # The following 3 handlers were disabled due to #576511.
                 # Please check the DRBD manual and enable them, if they 
make sense in your setup.
                 # pri-on-incon-degr 
"/usr/lib/drbd/notify-pri-on-incon-degr.sh; 
/usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; 
reboot -f";
                 # pri-lost-after-sb 
"/usr/lib/drbd/notify-pri-lost-after-sb.sh; 
/usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; 
reboot -f";
                 # local-io-error "/usr/lib/drbd/notify-io-error.sh; 
/usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger 
; halt -f";
                 # fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
                 # split-brain "/usr/lib/drbd/notify-split-brain.sh root";
                 # out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";
                 # before-resync-target 
"/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k";
                 # after-resync-target 
/usr/lib/drbd/unsnapshot-resync-target-lvm.sh;
         }
         startup {
                 # wfc-timeout degr-wfc-timeout outdated-wfc-timeout 
wait-after-sb
         }
         disk {
                 # on-io-error fencing use-bmbv no-disk-barrier 
no-disk-flushes
                 # no-disk-drain no-md-flushes max-bio-bvecs
         }
         net {
                 ko-count 2
                 timeout 50
                 # sndbuf-size rcvbuf-size timeout connect-int ping-int 
ping-timeout max-buffers
                 # max-epoch-size ko-count allow-two-primaries 
cram-hmac-alg shared-secret
                 # after-sb-0pri after-sb-1pri after-sb-2pri 
data-integrity-alg no-tcp-cork
         }

         syncer {
                 # rate after al-extents use-rle cpu-mask verify-alg 
csums-alg
         }
}

If I make this config one the secondary node, I can have a proper 
disconnection on the slave when we ave HW problems like on my previous 
post ?

Thanks

Matthieu Lejeune

Le 5/03/14 11:32, Philip Gaw a écrit :
>> Hi Matthieu,
>>
>> On 05/03/2014 07:29, Matthieu Lejeune wrote:
>>> Hi all,
>>>
>>> I had a problem this night with a DRBD Primary/Slave.
>>>
>>>
>>> The slave experienced a hardware issue (LSI controller freezed).
>>> It seems the master hold I/O waiting for the slave to respond until 
>>> timeout.
>>>
>>>
>>> This caused all targets exported trough infiniband to be 
>>> disconnected from the master.
>>>
>>>
>>> So, practically, the master stop responding due to a failure on the 
>>> slave.
>>>
>>> I had to hard reboot (power cycle) the slave because UDEV wasn't 
>>> responding and did not allow normal reboot.
>>> After slave reboot, drdb did reconnect. It was in status pri/sec 
>>> uptodate/uptodate.
>>> But the LSI controller immediatly timeout causing the same issue a 
>>> second time.
>>>
>>>
>>> How can we prevent issue on the slave to impact the master ?
>>>
>> have a look at ko-count
>>
>> |ko-count/|number|/|
>>
>>     In case the secondary node fails to complete a single write
>>     request for/|count|/times the/|timeout|/, it is expelled from the
>>     cluster. (I.e. the primary node goes into|StandAlone|mode.) The
>>     default value is 0, which disables this feature.
>>
>>
>> http://www.drbd.org/users-guide/re-drbdconf.html
>>
>>>
>>> Thank you.
>>> Matthieu Lejeune
>>>
>>>
>>> drbd8-utils :                         2:8.3.13-2 
>>> amd64                   RAID 1 over tcp/ip for Linux utilities
>>> Debian :
>>> root at ifprdstor8a:~/trunk# cat /proc/version
>>> Linux version 3.2.0-4-amd64 (debian-kernel at lists.debian.org) (gcc 
>>> version 4.6.3 (Debian 4.6.3-14) ) #1 SMP Debian 3.2.51-1
>>> root at ifprdstor8a:~/trunk#
>>>
>>> We are using the scst/srpt with the Trunk version of the 7 January 2014
>>>
>>> I give you the config :
>>> *drbd global : **
>>> *
>>> root at ifprdstor8a:/etc/drbd.d# cat global_common.conf
>>> global {
>>>     usage-count yes;
>>>     # minor-count dialog-refresh disable-ip-verification
>>> }
>>>
>>> common {
>>>     protocol C;
>>>
>>>     handlers {
>>>         # The following 3 handlers were disabled due to #576511.
>>>         # Please check the DRBD manual and enable them, if they make 
>>> sense in your setup.
>>>         # pri-on-incon-degr 
>>> "/usr/lib/drbd/notify-pri-on-incon-degr.sh; 
>>> /usr/lib/drbd/notify-emergency-reboot.sh; echo b > 
>>> /proc/sysrq-trigger ; reboot -f";
>>>         # pri-lost-after-sb 
>>> "/usr/lib/drbd/notify-pri-lost-after-sb.sh; 
>>> /usr/lib/drbd/notify-emergency-reboot.sh; echo b > 
>>> /proc/sysrq-trigger ; reboot -f";
>>>         # local-io-error "/usr/lib/drbd/notify-io-error.sh; 
>>> /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > 
>>> /proc/sysrq-trigger ; halt -f";
>>>
>>>         # fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
>>>         # split-brain "/usr/lib/drbd/notify-split-brain.sh root";
>>>         # out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";
>>>         # before-resync-target 
>>> "/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k";
>>>         # after-resync-target 
>>> /usr/lib/drbd/unsnapshot-resync-target-lvm.sh;
>>>     }
>>>
>>>     startup {
>>>         # wfc-timeout degr-wfc-timeout outdated-wfc-timeout 
>>> wait-after-sb
>>>     }
>>>
>>>     disk {
>>>         # on-io-error fencing use-bmbv no-disk-barrier no-disk-flushes
>>>         # no-disk-drain no-md-flushes max-bio-bvecs
>>>     }
>>>
>>>     net {
>>>         # sndbuf-size rcvbuf-size timeout connect-int ping-int 
>>> ping-timeout max-buffers
>>>         # max-epoch-size ko-count allow-two-primaries cram-hmac-alg 
>>> shared-secret
>>>         # after-sb-0pri after-sb-1pri after-sb-2pri 
>>> data-integrity-alg no-tcp-cork
>>>     }
>>>
>>>     syncer {
>>>         # rate after al-extents use-rle cpu-mask verify-alg csums-alg
>>>     }
>>> }
>>>
>>> *Ressource Configuration : *
>>>
>>> root at ifprdstor8a:/etc/drbd.d# cat DSA801.res
>>> resource DSA801 {
>>>   protocol C;
>>>
>>>   startup {
>>>     wfc-timeout 0;
>>>   }
>>>
>>>   disk {
>>>     on-io-error detach;
>>>   }
>>>
>>>   syncer {
>>>     rate 400M;
>>>     verify-alg md5;
>>>   }
>>>
>>>   on ifprdstor8a {
>>>     device    /dev/drbd1;
>>>     disk      /dev/sda;
>>>     address   10.13.1.5:7788;
>>>     meta-disk internal;
>>>   }
>>>
>>>   on ifprdstor8b {
>>>     device    /dev/drbd1;
>>>     disk      /dev/sda;
>>>     address   10.13.1.6:7788;
>>>     meta-disk internal;
>>>   }
>>> }
>>>
>
>
>
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20140310/73e5c26c/attachment.htm>