Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi, I am having a weird problem with two DRBD machines. These machines are exactly the same in hardware and software and are both running Debian Etch and DRBD 8.0.14. These two machines run Heartbeat and the IETD iSCSI target (latest stable version) in active/passive setup. The primary node is "stor1", the secondary "stor2". They have a direct Gigabit connection between each other, dedicated for DRBD. The problem is that Xen VMs running from this iSCSI target crash on high disk load, because (presumably) the iSCSI sessions time out. This only occurs when: - The two machines ("stor1" and "stor2") are both online, so stor1 is the active node, syncing to stor2. - stor1 is offline, so stor2 is the active node (and there is no DRBD syncing). This does not occur when: - stor2 is offline, so stor1 is the active node (there is no DRBD syncing between the nodes). Sometimes, a few seconds before one of these crashes, this error appears in the syslog, only of stor1. Note that this does not happen every time: drbd0: [drbd0_worker/26335] sock_sendmsg time expired, ko = 3 I am currently running just from stor1, and have had no crashes. >From this I conclude that the problem must lay somewhere in the stor2 machine. I just can't find out where. Both machines are running from A-brand hardware RAID, so disks can't be the problem. The machines have both Intel Gigabit Ethernet ports and on-board nVidia Gigabit Ethernet ports. Letting DRBD sync over another (brand) network interface does not make any difference. I am running the latest Intel e1000 drivers for these interfaces, hence I do not think the network is the problem. Following is my drbd.conf, which is pretty default, but just in case. If any more information is needed, I'd be glad to supply it. Thanks, Rik ### /etc/drbd.conf ### global { usage-count yes; } common { } resource resource0 { protocol C; handlers { pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f"; pri-lost-after-sb "echo o > /proc/sysrq-trigger ; halt -f"; local-io-error "echo o > /proc/sysrq-trigger ; halt -f"; outdate-peer "/usr/lib/heartbeat/drbd-peer-outdater -t 5"; } startup { degr-wfc-timeout 120; # 2 minutes. } disk { on-io-error detach; } net { ko-count 4; after-sb-0pri disconnect; after-sb-1pri disconnect; after-sb-2pri disconnect; rr-conflict disconnect; } syncer { rate 90M; al-extents 257; } on stor1 { device /dev/drbd0; disk /dev/sda9; address 192.168.3.121:7788; meta-disk /dev/sda8 [0]; } on stor2 { device /dev/drbd0; disk /dev/sda9; address 192.168.3.122:7788; meta-disk /dev/sda8 [0]; } }