On Tue, 19 Feb 2008, Florian Haas wrote:

>> I'm seeing the above error if I try to dual-mount the DRBD resource while
>> the local resource is still syncing off the remote resource.
>> I have a primary-primary setup with GFS.
>> Once the above error is encountered, GFS aborts joining the cluster.
> Logs? Configuration details? You didn't really give us much to work with...

This appears to be a strange compound problem.

Using DRBD 8.0.8, the performance is quite acceptable in single-node case. 
Unfortunately, when an attempt is made to dual-mount the GFS file system before 
the secondary is fully up to date (but is connected and syncing), the 2nd node 
to join notices an inconsistency, and withdraws from the cluster. In the 
process, GFS gets corrupted, and the only way to get it to mount again on 
either node is to repair it with fsck.

If the nodes are in sync by the time GFS tries to mount, the mount succeeds, 
but everything grinds to a halt shortly afterwards - so much so that the only 
way to get things moving again is to hard-reset one of the nodes, preferably 
the 2nd one to join.

Here is where the second thing that seems wrong happend - the first node 
doesn't just lock-up at this point, as one might expect (when a connected node 
disappears, e.g. due to a hard reset, cluster is supposed to try to fence it 
until it cleanly rejoins - and it can't possibly fence the other node since I 
haven't configured any fencing devices yet). This doesn't seem to happen. The 
first node seems to continue like nothing happened. This is possibly connected 
to the fact that by this point, GFS is corrupted and has to be fsck-ed at next 
boot. This part may be a cluster setup issue, so I'll raise that on the cluster 
list, although it seems to be a DRBD specific peculiarity - using a SAN doesn't 
have this issue with a nearly identical cluster.conf (only difference being the 
block device specification).

I tried 8.0.11, and the performance of that seems to be very poor. So poor that 
I've not yet had the patience to wait for the first node to boot. (I'm using 
Open Shared Root on this, so my DRBD file system is actually the root file 
system.) So, for now, I'm back to using 8.0.8.

I am using binary RPMs for drbd and kmod-drbd from the CentOS extras 

The nodes' cluster interface is cross-over connected node-to-node on a Gb 
ethernet interface, so performance there shouldn't be an issue. Syncing the 
block device seems to happen at approximately 30MB/s, which I suspect to be 
disk bound (2x 36GB RAID0 U160 SCSI).

The drbd.conf file is as follows:

         usage-count no;

                 rate 64M;

resource drbd1
         protocol C;
                 after-sb-0pri   discard-younger-primary;
                 after-sb-1pri   discard-secondary;
                 after-sb-2pri   call-pri-lost-after-sb;

                 cram-hmac-alg   sha1;
                 shared-secret   "secret";

                 wfc-timeout             60;
                 degr-wfc-timeout        60;
                 become-primary-on       both;

         on sentinel1c
                 device          /dev/drbd1;
                 disk            /dev/sda6;
                 meta-disk       internal;

         on sentinel2c
                 device          /dev/drbd1;
                 disk            /dev/sda6;
                 meta-disk       internal;

Another thing that seems to happen is that the node ends up somehow losing 
access to it's file system. After some arbitrary period of being idle, the 
whole thing just blocks. The node pings, ssh responds, but everything times out 
when an attempt to access the file system is made. Even logging in on the 
console doesn't work. Getting to the logs can be a bit difficult with OSR (they 
get reset on reboot), so I don't have those at the moment.

Has anyone seen anything like this problem before? I'm assuming for now that 
I'm misconfiguring something, so upgrading to 8.2.5 wouldn't really help.



