[DRBD-user] Trouble getting node to re-join two node cluster (OCFS2/DRBD Primary/Primary)

Thu Sep 22 17:29:12 CEST 2011

Mike, 

One issue in your CIB (though may not be the cause of this) is the order statement with promote: 
order ordDRBDDLM inf: msDRBD:promote cloneDLM 
If you explicitly define the action to take (promote) then that action is taken on all resources in that statement unless explicitly defined otherwise. So it should be: 
order ordDRBDDLM inf: msDRBD:promote cloneDLM:start 

Have you tried just rebooting the offending node? I know that's not the greatest answer but it's not serving anything right now anyway. 

Also how about attaching the logs when the disconnect happened from both nodes? 

Jake 

----- Original Message -----

> From: "Mike Reid" <MBReid at thepei.com>
> To: drbd-user at lists.linbit.com
> Sent: Thursday, September 15, 2011 4:50:44 PM
> Subject: [DRBD-user] Trouble getting node to re-join two node cluster
> (OCFS2/DRBD Primary/Primary)

> Trouble getting node to re-join two node cluster (OCFS2/DRBD
> Primary/Primary)
> Hello all,

> ** I have also posted this in the OCFS2/pacemaker list, but one
> response
> indicated it may be more specific to DRBD? **

> We have a two-node cluster still in development that has been running
> fine
> for weeks (little to no traffic). I made some updates to our CIB
> recently,
> and everything seemed just fine.

> Yesterday I attempted to untar ~1.5GB to the OCFS2/DRBD volume, and
> once it
> was complete one of the nodes had become completely disconnected and
> I
> haven't been able to reconnect since.

> DRBD is working fine, everything is UpToDate and I can get both nodes
> in
> Primary/Primary, but when it comes down to starting OCFS2 and
> mounting the
> volume, I'm left with:

> > resFS:0_start_0 (node=node1, call=21, rc=1, status=complete):
> > unknown error

> I am using "pcmk" as the cluster_stack, and letting Pacemaker control
> everything...

> The last time this happened the only way I was able to resolve it was
> to
> reformat the device (via mkfs.ocfs2 -F). I don't think I should have
> to do
> this, underlying blocks seem fine, and one of the nodes is running
> just
> fine. The (currently) unmounted node is staying in sync as far as
> DRBD is
> concerned.

> Here's some detail that hopefully will help, please let me know if
> there's
> anything else I can provide to help know the best way to get this
> node back
> "online":

> Ubuntu 10.10 / Kernel 2.6.35

> Pacemaker 1.0.9.1
> Corosync 1.2.1
> Cluster Agents 1.0.3 (Heartbeat)
> Cluster Glue 1.0.6
> OpenAIS 1.1.2

> DRBD 8.3.10
> OCFS2 1.5.0

> cat /sys/fs/ocfs2/cluster_stack = pcmk

> node1: mounted.ocfs2 -d

> Device FS UUID Label
> /dev/sda3 ocfs2 fe4273e1-f866-4541-bbcf-66c5dfd496d6

> node2: mounted.ocfs2 -d

> Device FS UUID Label
> /dev/sda3 ocfs2 d6f7cc6d-21d1-46d3-9792-bc650736a5ef
> /dev/drbd0 ocfs2 d6f7cc6d-21d1-46d3-9792-bc650736a5ef

> * NOTES:
> - Both nodes are identical, in fact one node is a direct mirror (hdd
> clone)
> - I have attached the CIB (crm configure edit contents) and mount
> trace

> ------ End of Forwarded Message

> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20110922/09edda5f/attachment.htm>