[DRBD-user] DRBD crash on two nodes cluster. Some help please?

Thu Oct 29 15:40:01 CET 2009

Hello all again.

In continuation to the bellow described issue, with integrity check
enabled, I used to get a crash at least once per 24 hours.

Now I have integrity check disabled and the cluster is running without
crashes for the last 9 days.

Could someone kindly provide some hints for the possible reasons  of
this observed behavior?

Off-loading is disabled on both dedicated gigabit NICs.

Also is integrity-check really needed (I have read the
documentation :) ) if it keeps on breaking the cluster?

Thank you All for your time.

Theophanis Kontogiannis

On Tue, 2009-10-20 at 20:31 +0300, Theophanis Kontogiannis wrote:

> Hello all,
> 
> Eventually I managed to get a log during DRBD crash.
> 
> I have a two nodes RHEL5.3 cluster with 2.6.18-164.el5xen and
> drbd-8.3.1-3  self compiled.
> 
> Both nodes have a dedicated 1G ethernet back to back connection over
> RTL8169sb/8110sb cards.
> 
> When I run applications, that constantly read or write to the disks
> (active/active config), drbd kept on crashing.
> 
> Now I have the logs for the reason of that:
> 
> 
> ______________________
> ON TWEETY1
> 
> Oct 20 15:46:52 localhost kernel: drbd2: Digest integrity check
> FAILED.
> Oct 20 15:46:52 localhost kernel: drbd2: Digest integrity check
> FAILED.
> Oct 20 15:46:52 localhost kernel: drbd2: error receiving Data, l: 540!
> Oct 20 15:46:52 localhost kernel: drbd2: error receiving Data, l: 540!
> Oct 20 15:46:52 localhost kernel: drbd2: peer( Primary -> Unknown )
> conn( Connected -> ProtocolError ) pdsk( UpToDate -> DUnknown )
> susp( 0 -> 1 ) 
> Oct 20 15:46:52 localhost kernel: drbd2: peer( Primary -> Unknown )
> conn( Connected -> ProtocolError ) pdsk( UpToDate -> DUnknown )
> susp( 0 -> 1 ) 
> Oct 20 15:46:52 localhost kernel: drbd2: asender terminated
> Oct 20 15:46:52 localhost kernel: drbd2: asender terminated
> Oct 20 15:46:52 localhost kernel: drbd2: Terminating asender thread
> Oct 20 15:46:52 localhost kernel: drbd2: Terminating asender thread
> Oct 20 15:46:52 localhost kernel: drbd2: Creating new current UUID
> Oct 20 15:46:52 localhost kernel: drbd2: Creating new current UUID
> Oct 20 15:46:52 localhost clurgmgrd: [4161]: <info>
> Executing /etc/init.d/drbd status 
> Oct 20 15:46:52 localhost clurgmgrd: [4161]: <info>
> Executing /etc/init.d/drbd status 
> Oct 20 15:46:52 localhost kernel: drbd2: Connection closed
> Oct 20 15:46:52 localhost kernel: drbd2: Connection closed
> 
> ___________________________
> 
> ON TWEETY2
> 
> 
> Oct 20 15:46:52 localhost kernel: drbd2: sock was reset by peer
> Oct 20 15:46:52 localhost kernel: drbd2: peer( Primary -> Unknown )
> conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown ) susp( 0
> -> 1 ) 
> Oct 20 15:46:52 localhost kernel: drbd2: short read expecting header
> on sock: r=-104
> Oct 20 15:46:52 localhost kernel: drbd2: meta connection shut down by
> peer.
> Oct 20 15:46:52 localhost kernel: drbd2: asender terminated
> Oct 20 15:46:52 localhost kernel: drbd2: Terminating asender thread
> Oct 20 15:46:52 localhost kernel: drbd2: Creating new current UUID
> Oct 20 15:46:52 localhost kernel: drbd2: Connection closed
> Oct 20 15:46:52 localhost kernel: drbd2: helper command: /sbin/drbdadm
> fence-peer minor-2
> 
> ____________________
> 
> 
> DRBD.CONF
> 
> 
> #
> # drbd.conf
> #
> 
> 
> global {
> 
>     usage-count yes;
> }
> 
> 
> common {
> 
>   protocol C;
> 
>   syncer {
> 
>     rate 100M;
> 
>     al-extents 257;
>   }
> 
>   
> handlers {
>     
>     pri-on-incon-degr "echo b > /proc/sysrq-trigger ; reboot -f";
> 
>     pri-lost-after-sb "echo b > /proc/sysrq-trigger ; reboot -f";
> 
>     local-io-error "echo o > /proc/sysrq-trigger ; halt -f";
> 
>     outdate-peer "/sbin/obliterate";
> 
> 
>     pri-lost "echo pri-lost. Have a look at the log files. | mail -s
> 'DRBD Alert' root; echo b > /proc/sysrq-trigger ; reboot -f";
> 
>     split-brain "echo split-brain. drbdadm -- --discard-my-data
> connect $DRBD_RESOURCE ? | mail -s 'DRBD Alert' root";
> 
>   }
> 
>   startup {
> 
>      wfc-timeout  60;
> 
> 
>     degr-wfc-timeout 60;    # 1 minutes.
> 
> 
>     become-primary-on both;
> 
>   }
> 
>   disk {
> 
>     fencing resource-and-stonith;
> 
> 
>   }
> 
>   net {
>     
>      sndbuf-size 512k;
> 
>      timeout       60;    #  6 seconds  (unit = 0.1 seconds)
>      connect-int   10;    # 10 seconds  (unit = 1 second)
>      ping-int      10;    # 10 seconds  (unit = 1 second)
>      ping-timeout  50;    # 500 ms (unit = 0.1 seconds)
> 
>      max-buffers     2048;
> 
>      max-epoch-size  2048;
> 
>      ko-count 10;
> 
> 
>     allow-two-primaries;
> 
> 
>       cram-hmac-alg "sha1";
>       shared-secret "*****";
> 
> 
>     after-sb-0pri discard-least-changes;
> 
>     after-sb-1pri violently-as0p;
> 
> 
>     after-sb-2pri violently-as0p;
> 
> 
>     rr-conflict call-pri-lost;
> 
> 
>     data-integrity-alg "crc32c";
> 
>   }
> 
> 
> }
> 
> 
> resource r0 {
> 
>         device          /dev/drbd0;
>         disk            /dev/hda4;
>         meta-disk       internal;
> 
> on tweety-1 { address   10.254.254.253:7788; }
> 
> on tweety-2 { address   10.254.254.254:7788; }
> 
> }
> 
> resource r1 {
> 
>         device        /dev/drbd1;
>         disk          /dev/hdb4;
>         meta-disk     internal;
> 
>   on tweety-1 { address  10.254.254.253:7789; }
> 
>   on tweety-2 { address  10.254.254.254:7789; }
> }
> 
> resource r2 {
> 
> device /dev/drbd2;
> disk /dev/sda1;
> meta-disk internal;
> 
>   on tweety-1 { address  10.254.254.253:7790; }
> 
>   on tweety-2 { address  10.254.254.254:7790; }
> }
> 
> _________
> 
> Also available in http://pastebin.ca/1633173
> 
> 
> How can I solve this?
> 
> Thank you All for your time.
> 
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20091029/87d14164/attachment.htm>