Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi All, This topic has been covered many times, on a good few mailing lists that I've already found. To quickly recap the issue in question (that I'm also suffering with): 2 hosts - Tara, and Inertia, using Linux HA in an active/passive configuration. Inertia is the primary, aka node 1. I pull the plug on the single network cable connecting Inertia to the switch. Drbd notices the dropped link. Tara is now Primary/Unknown. Inertia is Secondary/Unknown. The serial connection is still up between the nodes, HB negotiates with ping nodes, and fails over to Inertia. The HA failover scripts change Tara to Secondary/Unknown, and Inertia to Primary/Unknown. Great. All is working as designed so far, and service continues....good cluster. *pet pet* Now, I plug the cable back into tara, drbd notices, and prints: Sep 30 00:08:06 tara kernel: drbd0: Handshake successful: DRBD Network Protocol version 74 Sep 30 00:08:06 tara kernel: drbd0: Connection established. Sep 30 00:08:06 tara kernel: drbd0: I am(P): 1:00000002:00000001:00000060:0000001c:10 Sep 30 00:08:06 tara kernel: drbd0: Peer(S): 1:00000002:00000001:00000061:0000001b:00 Sep 30 00:08:06 tara kernel: drbd0: Current Primary shall become sync TARGET! Aborting to prevent data corruption. So, once inertia has network access restored, it is unable to resync. Now, I have already researched this and the drbd developers have explained that this is not drbd's fault. The reason is that BOTH sides have changed. I assume the simple act of umounting the disk (as part of the failover) on inertia is enough to count as a write, incrementing drbd's counters on inertia. There have been various suggestions along the lines of stonithing inertia, and that if inertia were to be restarted manually, the problem will go away. Neither of these sit well with me. I also understand from the road map that drbd 0.8 will have some options to deal with exactly this situation. With some exploring I found that if the Primary drbd host runs a simple 'drbd connect r0', the two will resync successfully. However if the secondary runs the same command this it won't work. So I wrote this script (be kind, this is my first ever bash script...) #!/bin/bash drbdadm=/sbin/drbdadm if grep -q Unknown /proc/drbd then echo "we have a broken drbd connection" $drbdadm connect r0 fi exit This was then added to cron on both machines, and set to run every 10 minutes, offset by 5 per server. This allows the system to work with either host running as the active node For me... this has fixed my split brain problem. This still isn't sophisticated enough to allow for a HA auto-fallback, but at least I have data synced on both disks increasing my redundancy until a) I switch back manually, or b) another failure takes out the failed over node, in which case this will have saved my bacon. Now, I might be misunderstanding exactly what drbd connect does. At face value this appears to simply be initiating a connection with it's peer, and it seems to me that this is something that drbd should be able to take care of itself internally. I have my connect-int set to 10 seconds, but from what I'm seeing here, they try *ONCE* and give up. A simple retry (in the "other" direction) should allow for a successful resync? As an intermediate step before 0.8, is this something that could be implemented in 0.7x? I know this doesn't cover all the situations of splitbrain, but I'm sure this would help others with this situation? Or am I missing something fundemental? Thanks, Jonathan Wheeler