[DRBD-user] DRBD crash on two nodes cluster. Some help please?

Theophanis Kontogiannis theophanis_kontogiannis at yahoo.com
Tue Oct 20 19:31:44 CEST 2009

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


 Hello all,

Eventually I managed to get a log during DRBD crash.

I have a two nodes RHEL5.3 cluster with 2.6.18-164.el5xen and
drbd-8.3.1-3  self compiled.

Both nodes have a dedicated 1G ethernet back to back connection over
RTL8169sb/8110sb cards.

When I run applications, that constantly read or write to the disks
(active/active config), drbd kept on crashing.

Now I have the logs for the reason of that:


______________________
ON TWEETY1

Oct 20 15:46:52 localhost kernel: drbd2: Digest integrity check FAILED.
Oct 20 15:46:52 localhost kernel: drbd2: Digest integrity check FAILED.
Oct 20 15:46:52 localhost kernel: drbd2: error receiving Data, l: 540!
Oct 20 15:46:52 localhost kernel: drbd2: error receiving Data, l: 540!
Oct 20 15:46:52 localhost kernel: drbd2: peer( Primary -> Unknown )
conn( Connected -> ProtocolError ) pdsk( UpToDate -> DUnknown ) susp( 0
-> 1 ) 
Oct 20 15:46:52 localhost kernel: drbd2: peer( Primary -> Unknown )
conn( Connected -> ProtocolError ) pdsk( UpToDate -> DUnknown ) susp( 0
-> 1 ) 
Oct 20 15:46:52 localhost kernel: drbd2: asender terminated
Oct 20 15:46:52 localhost kernel: drbd2: asender terminated
Oct 20 15:46:52 localhost kernel: drbd2: Terminating asender thread
Oct 20 15:46:52 localhost kernel: drbd2: Terminating asender thread
Oct 20 15:46:52 localhost kernel: drbd2: Creating new current UUID
Oct 20 15:46:52 localhost kernel: drbd2: Creating new current UUID
Oct 20 15:46:52 localhost clurgmgrd: [4161]: <info>
Executing /etc/init.d/drbd status 
Oct 20 15:46:52 localhost clurgmgrd: [4161]: <info>
Executing /etc/init.d/drbd status 
Oct 20 15:46:52 localhost kernel: drbd2: Connection closed
Oct 20 15:46:52 localhost kernel: drbd2: Connection closed

___________________________

ON TWEETY2


Oct 20 15:46:52 localhost kernel: drbd2: sock was reset by peer
Oct 20 15:46:52 localhost kernel: drbd2: peer( Primary -> Unknown )
conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown ) susp( 0 ->
1 ) 
Oct 20 15:46:52 localhost kernel: drbd2: short read expecting header on
sock: r=-104
Oct 20 15:46:52 localhost kernel: drbd2: meta connection shut down by
peer.
Oct 20 15:46:52 localhost kernel: drbd2: asender terminated
Oct 20 15:46:52 localhost kernel: drbd2: Terminating asender thread
Oct 20 15:46:52 localhost kernel: drbd2: Creating new current UUID
Oct 20 15:46:52 localhost kernel: drbd2: Connection closed
Oct 20 15:46:52 localhost kernel: drbd2: helper command: /sbin/drbdadm
fence-peer minor-2

____________________


DRBD.CONF


#
# drbd.conf
#


global {

    usage-count yes;
}


common {

  protocol C;

  syncer {

    rate 100M;

    al-extents 257;
  }

  
 handlers {
    
    pri-on-incon-degr "echo b > /proc/sysrq-trigger ; reboot -f";

    pri-lost-after-sb "echo b > /proc/sysrq-trigger ; reboot -f";

    local-io-error "echo o > /proc/sysrq-trigger ; halt -f";

    outdate-peer "/sbin/obliterate";


    pri-lost "echo pri-lost. Have a look at the log files. | mail -s
'DRBD Alert' root; echo b > /proc/sysrq-trigger ; reboot -f";

    split-brain "echo split-brain. drbdadm -- --discard-my-data connect
$DRBD_RESOURCE ? | mail -s 'DRBD Alert' root";

  }

  startup {

     wfc-timeout  60;


    degr-wfc-timeout 60;    # 1 minutes.


    become-primary-on both;

  }

  disk {

    fencing resource-and-stonith;


  }

  net {
    
     sndbuf-size 512k;

     timeout       60;    #  6 seconds  (unit = 0.1 seconds)
     connect-int   10;    # 10 seconds  (unit = 1 second)
     ping-int      10;    # 10 seconds  (unit = 1 second)
     ping-timeout  50;    # 500 ms (unit = 0.1 seconds)

     max-buffers     2048;

     max-epoch-size  2048;

     ko-count 10;


    allow-two-primaries;


      cram-hmac-alg "sha1";
      shared-secret "*****";


    after-sb-0pri discard-least-changes;

    after-sb-1pri violently-as0p;


    after-sb-2pri violently-as0p;


    rr-conflict call-pri-lost;


    data-integrity-alg "crc32c";

  }


}


resource r0 {

        device          /dev/drbd0;
        disk            /dev/hda4;
        meta-disk       internal;

 on tweety-1 { address   10.254.254.253:7788; }

 on tweety-2 { address   10.254.254.254:7788; }

}

resource r1 {

        device        /dev/drbd1;
        disk          /dev/hdb4;
        meta-disk     internal;

  on tweety-1 { address  10.254.254.253:7789; }

  on tweety-2 { address  10.254.254.254:7789; }
}

resource r2 {

	device		/dev/drbd2;
	disk		/dev/sda1;
	meta-disk	internal;

  on tweety-1 { address  10.254.254.253:7790; }

  on tweety-2 { address  10.254.254.254:7790; }
}

_________

Also available in http://pastebin.ca/1633173


How can I solve this?

Thank you All for your time.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20091020/dec50695/attachment.htm>


More information about the drbd-user mailing list