[DRBD-user] Server reboot on DRBD heavy load problem

Proskurin Kirill proskurin-kv at fxclub.org
Mon Sep 20 12:01:36 CEST 2010

Hello all.

I fight with strange problem for more than a 3 week.

What we have:
2xDell 2950 with Debian 5.0 2.6.32-bpo.5-amd64 from backports with DRBD 
inside + OCFS2

I make a heavy load by iozone on OCFS2 partition:
iozone -RK -t 4 -s 10g -i 0 -i 1 -i 2 -b /tmp/`hostname`.xls
on both nodes.

And after a 1-3 hour servers(both) reboots. It is DRBD or OCFS2 related 
just because it is not happend on normal partition. OCFS2 developers 
look at stack trace what I catch(in attachment) and say what it is not a 
OCFS2 problem.

I start to think what it is hardware or system. I try 2.6.26 kernel and 
updating to testing - not helps at all.

So - it is hardware or DRBD. Could you please help me to find out there 
problem is?

Configs below:

mail01:~# cat /etc/drbd.d/drbd0.res
resource drbd0 {

on mail01.fxclub.org {
device /dev/drbd0;
disk /dev/sda9;
meta-disk internal;

on mail02.fxclub.org {
device /dev/drbd0;
disk /dev/sda9;
meta-disk internal;


mail01:~# cat /etc/drbd.d/global_common.conf
global {
	usage-count yes;
	# minor-count dialog-refresh disable-ip-verification

common {
	protocol C;

	handlers {
		pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; 
/usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; 
reboot -f";
		pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; 
/usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; 
reboot -f";
		local-io-error "/usr/lib/drbd/notify-io-error.sh; 
/usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger 
; halt -f";
		outdate-peer "/usr/lib/heartbeat/drbd-peer-outdater -t 5";
		# fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
		split-brain "/usr/lib/drbd/notify-split-brain.sh root";
		# out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";
		# before-resync-target "/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 
15 -- -c 16k";
		# after-resync-target /usr/lib/drbd/unsnapshot-resync-target-lvm.sh;

	startup {
		wfc-timeout 60;
		degr-wfc-timeout 30;
		outdated-wfc-timeout 15;
		become-primary-on both;
		# wait-after-sb;

	disk {
		fencing resource-and-stonith;
		# on-io-error fencing use-bmbv no-disk-barrier no-disk-flushes
		# no-disk-drain no-md-flushes max-bio-bvecs

	net {
		cram-hmac-alg sha1;
		shared-secret "password";
		ping-timeout 20;
		after-sb-0pri discard-zero-changes;
		after-sb-1pri discard-secondary;
		after-sb-2pri disconnect;	
		data-integrity-alg sha1;
                 # Tuning
                 max-buffers 8000;
                 max-epoch-size 8000;
                 sndbuf-size 0;
		# snd.buf-size rcvbuf-size timeout connect-int ping-int ping-timeout 
		# max-epoch-size ko-count allow-two-primaries cram-hmac-alg shared-secret
		# after-sb-0pri after-sb-1pri after-sb-2pri data-integrity-alg no-tcp-cork

	syncer {
		rate 60M;
		al-extents 3389;
		# rate after al-extents use-rle cpu-mask verify-alg csums-alg

P.S. I start to think what it can be a handlers and comented them - not 

Best regards,
Proskurin Kirill
