Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On 05/18/2011 03:19 AM, Felix Frank wrote:
> On 05/17/2011 09:24 PM, listslut at outofoptions.net wrote:
>> I inherited a broken cluster.  With the help of a national vendor I am
>> worse off and 'the good node' is a tad hosed.  I upgraded to the latest
>> kernel and got everything back to this point. (Note, the good node was
>> shut out of the cluster so the other one is still up and working.  That
>> one hangs on an 'ls' command that is why I have my doubts about it).
>> There was data on this node this morning.  I think.  I'd prefer not to
>> hose the data on this node in case the other node has a problem.  I'm
>> hoping I can just bring it up and it syncs and life is good.
>>
>> [root at julius init.d]# df
>> Filesystem           1K-blocks      Used Available Use% Mounted on
>> /dev/cciss/c0d0p1     36562540  10644068  24031240  31% /
>> tmpfs                  6147644         0   6147644   0% /dev/shm
>> [root at julius init.d]# mount /srv/vmdata/
>> /sbin/mount.gfs2: can't open /dev/drbd0: Wrong medium type
>> [root at julius init.d]# service drbd stop
>> Stopping all DRBD resources.
>> [root at julius init.d]# /sbin/drbdadm create-md drbd0
>> md_offset 986671665152
>> al_offset 986671632384
>> bm_offset 986641518592
>>
>> Found some data
>>   ==>  This might destroy existing data!<==
>>
>> Do you want to proceed?
>> [need to type 'yes' to confirm] no
> Good choice.
>
> Activate the DRBD service again. Examine the contents of the device
> using "file -sL /dev/drbd0" (should recognize the gfs2?).
> I believe that if you want to get data back, you may want to run some
> sort of fsck against drbd0.
[root at julius ~]# file -sL /dev/drbd0
/dev/drbd0: GFS2 Filesystem (blocksize 4096, lockproto lock_dlm)
[root at julius ~]#
> Speaking of cluster file systems - is this a dual-primary setup? Is the
> "good node" Primary?
This is a primary/primary set up.
net {
         timeout 50;
         connect-int 10;
         ping-int 10;
         allow-two-primaries;
> In this case, you will most likely face split brain either way, so
> syncing back up won't be easy. It may then be your best shot (and most
> simple solution) to heal your "good node" (find out what's blocking it)
> and use that as sync source.
It was set up so all the vm's were on one node, the other was for fail 
over, so split brain shouldn't really be an issue.
> On the "good node", could it be that the dlm is biting you since the
> peer node is in trouble?
>
I get this on the node that is locked out so I guess you may be right:
root at julius ~]# mount /srv/vmdata/
/sbin/mount.gfs2: can't connect to gfs_controld: Connection refused
<repeats>
/sbin/mount.gfs2: can't connect to gfs_controld: Connection refused
/sbin/mount.gfs2: gfs_controld not running
/sbin/mount.gfs2: error mounting lockproto lock_dlm
There is no error recovery at all specified in the config file.
Would running this from the "good node" be the next step?
*|drbdadm -- --overwrite-data-of-peer primary/|resource
resource would be|/|*drbd0 from the config file
I greatly appreciate your taking the time to help.
Thank You
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20110518/f23bda76/attachment.htm>