[DRBD-user] Primary not disconnecting Secondary with IO problems

Fri Nov 27 14:48:36 CET 2009

Hi list,

I'm using DRBD and NFS to provide HA to Virtual Machine images between pairs of storage servers.

Systems are RHEL5.4 2.6.18-164.el5 + drbd8.3 from Centos Extras

We've been having issues where disk I/O problems on the DRBD Secondary stops all IO to the Primary
too. DRBD doesn't seem to recognise these disk I/O problems, the Secondary isn't disconnected
automatically. Everything just hangs.

During this state:
If I try a "drbdadm disconnect all" on the Primary, the command hangs.
If I try this on the Secondary, the command eventually completes, and NFS I/O returns to normal
operation on the Primary.

I've tried the following things to fix this:

1) Putting in a custom local-io-error handler to hard reset the problem node.

This never triggers. Just like the default "detach", never triggers.

2) Changing the net connection parameters to:

	net {
		ko-count 2;
		timeout 20;
	}

Again, this never triggers.

3) Changing the protocol used from C to B

Doesn't have any effect on the issue - I'd prefer to use C anyway.

Any further ideas on how to track this issue down and fix it?

thanks

James Masson