[DRBD-user] Default Split Brain Behaviour

Thu Jan 27 03:24:31 CET 2011

Thanks again Felix,

> > common {
> > 	protocol A;
> >
> > 	handlers {
> > 		pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh";
> > 		pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh";
> > 		echo o > /proc/sysrq-trigger ; halt -f";
> 
> The above looks..."funny" to me. What's wrong here? Copy/Paste error?
> 
> Did you modify any notify-* scripts?
Ah, I see what you mean; just a cut paste error I missed (apologies, a stupid mistake); it should have read...
common {
        protocol A;

        handlers {
                pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh";
                pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh";
                local-io-error "/usr/lib/drbd/notify-io-error.sh";
                split-brain "/usr/lib/drbd/notify-split-brain.sh root";
                out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";
                        }
I think from memory, that I hashed the original line in the default global config ...
#local-io-error "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f";
and replaced it with the line as seen above ...

local-io-error "/usr/lib/drbd/notify-io-error.sh";

I didn't want any situations where an extreme load induced io-error would generate an emergency shutdown, as it's a virtualization server.
I did want to be notified though.

date stamps on the notify-* scripts are all uniform (predating the system build) & I don't recall modifying them at all.

>From the logs, I'm curious about the lines...
Jan 23 15:07:16 emlsurit-v4 kernel: [   15.044910] block drbd9: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
Jan 23 15:07:16 emlsurit-v4 kernel: [   15.044929] block drbd9: Marked additional 508 MB as out-of-sync based on AL.

...then a little further down
Jan 23 15:53:01 emlsurit-v4 kernel: [ 2756.121108] block drbd9: role( Secondary -> Primary )

These errors where only seen on the rebooted node that was primary. The log entries for the two nodes where ostensibly the same other than this. 

This node was always primary and the KVM virtual machine running off it, does not even exist on the other node; yet it has reversed the resource roles (primary vs secondary).

The nodes of the resource where in a disconnected state prior to the reboot of the primary node.
The secondary (disconnected) node remained on and the there is no HA setup associated with either node on any resource.

I did note a clock skew of 3 minutes between the nodes, due to an incorrect ntp source.

On both nodes, I also noticed ... block drbd9: helper command: /sbin/drbdadm split-brain minor-9 exit code 127 (0x7f00)

Somehow the (508 Mb?) data has rolled back, & while I'm sad I've likely lost the data, I can't afford to release this system to production until I'm content it won't happen again.

The userland tools are ...

drbdadm --version
DRBDADM_BUILDTAG=GIT-hash:\ ea9e28dbff98e331a62bcbcc63a6135808fe2917\ build\ by\ buildd at yellow\,\ 2010-06-01\ 11:06:12
DRBDADM_API_VERSION=88
DRBD_KERNEL_VERSION_CODE=0x080307
DRBDADM_VERSION_CODE=0x080307
DRBDADM_VERSION=8.3.7

Any assistance to help me dig a little deeper here, will be greatly appreciated.

Cheers,

Lew