Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Tue, 19 Feb 2008, Florian Haas wrote: >> I'm seeing the above error if I try to dual-mount the DRBD resource while >> the local resource is still syncing off the remote resource. >> >> I have a primary-primary setup with GFS. >> >> Once the above error is encountered, GFS aborts joining the cluster. > > Logs? Configuration details? You didn't really give us much to work with... This appears to be a strange compound problem. Using DRBD 8.0.8, the performance is quite acceptable in single-node case. Unfortunately, when an attempt is made to dual-mount the GFS file system before the secondary is fully up to date (but is connected and syncing), the 2nd node to join notices an inconsistency, and withdraws from the cluster. In the process, GFS gets corrupted, and the only way to get it to mount again on either node is to repair it with fsck. If the nodes are in sync by the time GFS tries to mount, the mount succeeds, but everything grinds to a halt shortly afterwards - so much so that the only way to get things moving again is to hard-reset one of the nodes, preferably the 2nd one to join. Here is where the second thing that seems wrong happend - the first node doesn't just lock-up at this point, as one might expect (when a connected node disappears, e.g. due to a hard reset, cluster is supposed to try to fence it until it cleanly rejoins - and it can't possibly fence the other node since I haven't configured any fencing devices yet). This doesn't seem to happen. The first node seems to continue like nothing happened. This is possibly connected to the fact that by this point, GFS is corrupted and has to be fsck-ed at next boot. This part may be a cluster setup issue, so I'll raise that on the cluster list, although it seems to be a DRBD specific peculiarity - using a SAN doesn't have this issue with a nearly identical cluster.conf (only difference being the block device specification). I tried 8.0.11, and the performance of that seems to be very poor. So poor that I've not yet had the patience to wait for the first node to boot. (I'm using Open Shared Root on this, so my DRBD file system is actually the root file system.) So, for now, I'm back to using 8.0.8. I am using binary RPMs for drbd and kmod-drbd from the CentOS extras repository. The nodes' cluster interface is cross-over connected node-to-node on a Gb ethernet interface, so performance there shouldn't be an issue. Syncing the block device seems to happen at approximately 30MB/s, which I suspect to be disk bound (2x 36GB RAID0 U160 SCSI). The drbd.conf file is as follows: global { usage-count no; } common { syncer { rate 64M; } } resource drbd1 { protocol C; net { allow-two-primaries; after-sb-0pri discard-younger-primary; after-sb-1pri discard-secondary; after-sb-2pri call-pri-lost-after-sb; cram-hmac-alg sha1; shared-secret "secret"; } startup { wfc-timeout 60; degr-wfc-timeout 60; become-primary-on both; } on sentinel1c { device /dev/drbd1; disk /dev/sda6; address 10.0.0.1:7789; meta-disk internal; } on sentinel2c { device /dev/drbd1; disk /dev/sda6; address 10.0.0.2:7789; meta-disk internal; } } Another thing that seems to happen is that the node ends up somehow losing access to it's file system. After some arbitrary period of being idle, the whole thing just blocks. The node pings, ssh responds, but everything times out when an attempt to access the file system is made. Even logging in on the console doesn't work. Getting to the logs can be a bit difficult with OSR (they get reset on reboot), so I don't have those at the moment. Has anyone seen anything like this problem before? I'm assuming for now that I'm misconfiguring something, so upgrading to 8.2.5 wouldn't really help. TIA. Gordan