Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello, I have a couple of Dell 2950 III, both of them with CentOS 5.3, Xen, drbd 8.2 and cluster suite. Hardware: 32DB RAM, RAID 5 with 6 SAS disks (one hot spare) on a PERC/6 controller. I configured DRBD to use the main network interfaces (bnx2 driver), with bonding and crossover cables to have a direct link. The normal network traffic uses two different network cards. There are two DRBD resources for a total of a little less than 1TB. When the two hosts are in sync, if I activate more than a few (six or seven) xen guests, the master server crashes spectacularly and reboots. I've seen a kernel dump over the serial console, but the machine restarts immediately so I didn't have the chance to write it down. Unfortunately I cannot experiment because I have production services on those machines (and they are working fine until I start drbd on the slave). drbd configuration is attached. I've read some threads about a similar problem, but in my case the hosts aren't loosing the connection. The network cards used by drbd are: Ethernet controller: Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet and the offload settings (defaults): Offload parameters for eth0: Cannot get device udp large send offload settings: Operation not supported rx-checksumming: on tx-checksumming: on scatter-gather: on tcp segmentation offload: on udp fragmentation offload: off generic segmentation offload: off Are the suggested workarounds valid for my case too? Thanks in advance, Andrea -- Andrea Dell'Amico - <http://www.link.it/> -------------- next part -------------- # # At most ONE global section is allowed. # It must precede any resource section. # global { # minor-count 64; # dialog-refresh 5; # 5 seconds # disable-ip-verification; usage-count no; } common { syncer { rate 50M; } } # # this need not be r#, you may use phony resource names, # like "resource web" or "resource mail", too # resource virtual1 { protocol C; handlers { pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f"; pri-lost-after-sb "echo o > /proc/sysrq-trigger ; halt -f"; local-io-error "echo o > /proc/sysrq-trigger ; halt -f"; #outdate-peer "/usr/lib/heartbeat/drbd-peer-outdater -t 5"; #pri-lost "echo pri-lost. Have a look at the log files. | mail -s 'DRBD Alert' root"; #split-brain "echo split-brain. drbdadm -- --discard-my-data connect $DRBD_RESOURCE ? | mail -s 'DRBD Alert' root"; #out-of-sync "echo out-of-sync. drbdadm down $DRBD_RESOURCE. drbdadm ::::0 set-gi $DRBD_RESOURCE. drbdadm up $DRBD_RESOURCE. | mail -s 'DRBD Alert' root"; #before-resync-target "/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k"; #after-resync-target /usr/lib/drbd/unsnapshot-resync-target-lvm.sh; } startup { wfc-timeout 120; degr-wfc-timeout 120; # 2 minutes. # wait-after-sb; } disk { on-io-error pass_on; fencing resource-and-stonith; # ONLY USE THIS OPTION IF YOU KNOW WHAT YOU ARE DOING. # no-disk-flushes; # no-md-flushes; max-bio-bvecs 1; } net { # max-buffers 2048; # unplug-watermark 128; # max-epoch-size 2048; # ko-count 4; cram-hmac-alg "sha1"; shared-secret "bah"; after-sb-0pri discard-younger-primary; after-sb-1pri call-pri-lost-after-sb; after-sb-2pri call-pri-lost-after-sb; rr-conflict disconnect; # data-integrity-alg "md5"; } syncer { rate 50M; #after "r2"; al-extents 257; # cpu-mask 15; } on server-1 { device /dev/drbd1; disk /dev/sda2; address 192.168.2.1:7788; meta-disk internal; } on server-2 { device /dev/drbd1; disk /dev/sda4; address 192.168.2.2:7788; meta-disk internal; } } resource servizi0 { #********** # protocol C; handlers { pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f"; pri-lost-after-sb "echo o > /proc/sysrq-trigger ; halt -f"; local-io-error "echo o > /proc/sysrq-trigger ; halt -f"; #outdate-peer "/usr/lib/heartbeat/drbd-peer-outdater -t 5"; #pri-lost "echo pri-lost. Have a look at the log files. | mail -s 'DRBD Alert' root"; # Notify someone in case DRBD split brained. #split-brain "echo split-brain. drbdadm -- --discard-my-data connect $DRBD_RESOURCE ? | mail -s 'DRBD Alert' root"; # Notify someone in case an online verify run found the backing devices out of sync. #out-of-sync "echo out-of-sync. drbdadm down $DRBD_RESOURCE. drbdadm ::::0 set-gi $DRBD_RESOURCE. drbdadm up $DRBD_RESOURCE. | mail -s 'DRBD Alert' root"; # #before-resync-target "/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k"; #after-resync-target /usr/lib/drbd/unsnapshot-resync-target-lvm.sh; } startup { wfc-timeout 120; degr-wfc-timeout 120; # 2 minutes. # wait-after-sb; # become-primary-on both; } disk { on-io-error pass_on; fencing resource-and-stonith; # size 10G; # no-disk-flushes; # no-md-flushes; max-bio-bvecs 1; } net { # sndbuf-size 512k; # timeout 60; # 6 seconds (unit = 0.1 seconds) # connect-int 10; # 10 seconds (unit = 1 second) # ping-int 10; # 10 seconds (unit = 1 second) # ping-timeout 5; # 500 ms (unit = 0.1 seconds) # max-buffers 2048; # unplug-watermark 128; # max-epoch-size 2048; # ko-count 4; # allow-two-primaries; cram-hmac-alg "sha1"; shared-secret "serviziDRBDcosasegretissima"; after-sb-0pri discard-younger-primary; after-sb-1pri call-pri-lost-after-sb; after-sb-2pri call-pri-lost-after-sb; rr-conflict disconnect; # data-integrity-alg "md5"; } syncer { rate 50M; #after "r2"; al-extents 257; # cpu-mask 15; } on server-1 { device /dev/drbd0; disk /dev/sda5; address 192.168.2.1:7789; meta-disk internal; } on server-2 { device /dev/drbd0; disk /dev/sda3; address 192.168.2.2:7789; meta-disk internal; } } -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 307 bytes Desc: This is a digitally signed message part URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20090730/456cf47c/attachment.pgp>