[DRBD-user] GFS2 freezes

Mon Oct 29 22:46:34 CET 2012

On 10/29/12 9:43 AM, Maurits van de Lande wrote:
> Hello,
> 
>  
> 
> When  one  node unexpectedly shuts down, dlm locks down until quorum is
> regained AND the faulty node is fenced, before it can take over the
> cluster resources.
> 
>  
> 
> I assume that you have set the “two_node” flag  in cluster.conf
> 
>  
> 
>>Oct 29 08:05:59 node1 fenced[1401]: fence node2 dev 0.0 agent fence_ack_manual result: error from agent
> 
> Oct 29 08:05:59 node1 fenced[1401]: fence node2 failed
> 
>  
> 
> I think that adding the following option to the dlm section in cluster.conf
> 
> enable_fencing="0"
> 
> might solve this problem. (but I have not tested this) This will disable
> fencing.
> 
>  
> 
> Or you can setup fencing.
> 
>  
> 
> Best regards,
> 
>  
> 
> Maurits van de Lande
> 
>  
> 
>  
> 
>  
> 
> *Van:*drbd-user-bounces at lists.linbit.com
> [mailto:drbd-user-bounces at lists.linbit.com] *Namens *Zohair Raza
> *Verzonden:* maandag 29 oktober 2012 11:03
> *Aan:* drbd-user at lists.linbit.com
> *Onderwerp:* [DRBD-user] GFS2 freezes
> 
>  
> 
> Hi, 
> 
>  
> 
> I have setup a Primary/Primary cluster with GFS2.
> 
>  
> 
> All works good if I shut down any node regularly, but when I unplug
> power of any node, GFS freezes and I can not access the device. 
> 
>  
> 
> Tried to use http://people.redhat.com/lhh/obliterate 
> 
>  
> 
> this is what I see in logs 
> 
>  
> 
> Oct 29 08:05:41 node1 kernel: d-con res0: PingAck did not arrive in time.
> 
> Oct 29 08:05:41 node1 kernel: d-con res0: peer( Primary -> Unknown )
> conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) susp( 0
> -> 1 )
> 
> Oct 29 08:05:41 node1 kernel: d-con res0: asender terminated
> 
> Oct 29 08:05:41 node1 kernel: d-con res0: Terminating asender thread
> 
> Oct 29 08:05:41 node1 kernel: d-con res0: Connection closed
> 
> Oct 29 08:05:41 node1 kernel: d-con res0: conn( NetworkFailure ->
> Unconnected )
> 
> Oct 29 08:05:41 node1 kernel: d-con res0: receiver terminated
> 
> Oct 29 08:05:41 node1 kernel: d-con res0: Restarting receiver thread
> 
> Oct 29 08:05:41 node1 kernel: d-con res0: receiver (re)started
> 
> Oct 29 08:05:41 node1 kernel: d-con res0: conn( Unconnected ->
> WFConnection )
> 
> Oct 29 08:05:41 node1 kernel: d-con res0: helper command: /sbin/drbdadm
> fence-peer res0
> 
> Oct 29 08:05:41 node1 fence_node[1912]: fence node2 failed
> 
> Oct 29 08:05:41 node1 kernel: d-con res0: helper command: /sbin/drbdadm
> fence-peer res0 exit code 1 (0x100)
> 
> Oct 29 08:05:41 node1 kernel: d-con res0: fence-peer helper broken,
> returned 1
> 
> Oct 29 08:05:48 node1 corosync[1346]:   [TOTEM ] A processor failed,
> forming new configuration.
> 
> Oct 29 08:05:53 node1 corosync[1346]:   [QUORUM] Members[1]: 1
> 
> Oct 29 08:05:53 node1 corosync[1346]:   [TOTEM ] A processor joined or
> left the membership and a new membership was formed.
> 
> Oct 29 08:05:53 node1 corosync[1346]:   [CPG   ] chosen downlist: sender
> r(0) ip(192.168.23.128) ; members(old:2 left:1)
> 
> Oct 29 08:05:53 node1 corosync[1346]:   [MAIN  ] Completed service
> synchronization, ready to provide service.
> 
> Oct 29 08:05:53 node1 kernel: dlm: closing connection to node 2
> 
> Oct 29 08:05:53 node1 fenced[1401]: fencing node node2
> 
> Oct 29 08:05:53 node1 kernel: GFS2: fsid=cluster-setup:res0.0: jid=1:
> Trying to acquire journal lock...
> 
> Oct 29 08:05:53 node1 fenced[1401]: fence node2 dev 0.0 agent
> fence_ack_manual result: error from agent
> 
> Oct 29 08:05:53 node1 fenced[1401]: fence node2 failed
> 
> Oct 29 08:05:56 node1 fenced[1401]: fencing node node2
> 
> Oct 29 08:05:56 node1 fenced[1401]: fence node2 dev 0.0 agent
> fence_ack_manual result: error from agent
> 
> Oct 29 08:05:56 node1 fenced[1401]: fence node2 failed
> 
> Oct 29 08:05:59 node1 fenced[1401]: fencing node node2
> 
> Oct 29 08:05:59 node1 fenced[1401]: fence node2 dev 0.0 agent
> fence_ack_manual result: error from agent
> 
> Oct 29 08:05:59 node1 fenced[1401]: fence node2 failed
> 
> 
> Regards,
> Zohair Raza

I had a similar problem. The issue turned out to have nothing directly
to do with DRBD or DLM, but that I had put clvmd under
pacemaker/corosync control. I had to start cman and clvmd with the usual
Linux init scripts, and mount the individual GFS2 filesystems with
pacemaker.

I hope this helps.

-- 
William Seligman          | http://www.nevis.columbia.edu/~seligman/
Nevis Labs, Columbia Univ |
PO Box 137                |
Irvington NY 10533  USA   | Phone: (914) 591-2823

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4510 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20121029/dc4bc270/attachment.bin>