[DRBD-user] DRBD suitable for Fast Failover ?

Thu May 28 08:56:03 CEST 2015

Hi

During my tests with DRBD in Dual Primary Mode (see Thread 'Dual Primary Mode: Shared Directory blocked after node crash until reboot') i asked myself if DRBD is suitable for Fast Failover ?

My test configuration looks like this:

- 2-Node Redhat 7 Cluster (Pacemaker, Corosync)
- ILO4 Fencing Device's
- DRBD 8.4 in Dual Primary Mode
- GFS2 Shared Filesystem

Test Case: Node crash simulation: Hard Reset using 'echo b > /proc/sysrq-trigger' or Pull power plug.

After a node crash simulation the shared directory on the active node is blocked for writing for about 15 seconds due fencing.

Suppose you have an application running on each node. It's internal state (incrementing a counter variable) is saved (and replicated) into the shared directory.
On one node the application works as 'Main' and on the other node as 'Standby' (achieved by a 0-byte Lock-File in the shared directory).
If the Main-Node fails the Standby-Node should switch to mode 'Main' and take over the work BUT with the configuration as described above the Standby-Node can resume the work only after 15 seconds, because the writing of the counter variable is blocked until fencing completes. 

My goal is to achieve a failover of about 1-2 seconds.

Maybe i should switch to a DRBD Master-Slave configuration (only the Main-Node writes to the shared directory) ?

Any other suggestions ?

Thanks