[DRBD-user] GFS2 freezes

Tue Oct 30 19:03:09 CET 2012

fence_node <peer> doesn't work for me

fence_node node2 says

fence node2 failed

Regards,
Zohair Raza

On Tue, Oct 30, 2012 at 7:58 PM, Digimer <lists at alteeve.ca> wrote:

> Manual fencing is not in any way supported. You must be able to call
> 'fence_node <peer>' and have the remote node reset. If this doesn't
> happen, your fencing is not sufficient.
>
> On 10/30/2012 05:43 AM, Zohair Raza wrote:
> > I have rebuild the setup, and enabled fencing
> >
> > Manual fencing (fence_ack_manual) works okay when I fence one a dead
> > node from command line but it is not doing automatically
> >
> > Logs:
> > Oct 30 12:05:52 node1 kernel: dlm: closing connection to node 2
> > Oct 30 12:05:52 node1 fenced[1414]: fencing node node2
> > Oct 30 12:05:52 node1 kernel: GFS2: fsid=cluster1:gfs.0: jid=1: Trying
> > to acquire journal lock...
> > Oct 30 12:05:52 node1 fenced[1414]: fence node2 dev 0.0 agent
> > fence_manual result: error from agent
> > Oct 30 12:05:52 node1 fenced[1414]: fence node2 failed
> > Oct 30 12:05:55 node1 fenced[1414]: fencing node node2
> >
> > Cluster.conf:
> >
> > <?xml version="1.0"?>
> > <cluster name="cluster1" config_version="3">
> > <cman two_node="1" expected_votes="1"/>
> > <clusternodes>
> > <clusternode name="node1" votes="1" nodeid="1">
> >         <fence>
> >                 <method name="single">
> >                         <device name="manual" ipaddr="192.168.23.128"/>
> >                 </method>
> >         </fence>
> > </clusternode>
> > <clusternode name="node2" votes="1" nodeid="2">
> >         <fence>
> >                 <method name="single">
> >                         <device name="manual" ipaddr="192.168.23.129"/>
> >                 </method>
> >         </fence>
> > </clusternode>
> > </clusternodes>
> > <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
> > <fencedevices>
> >         <fencedevice name="manual" agent="fence_manual"/>
> > </fencedevices>
> > </cluster>
> >
> > drbd.conf:
> >
> > resource res0 {
> >   protocol C;
> >   startup {
> >     wfc-timeout 20;
> >     degr-wfc-timeout 10;
> >     # we will keep this commented until tested successfully:
> >      become-primary-on both;
> >   }
> >   net {
> >     # the encryption part can be omitted when using a dedicated link for
> > DRBD only:
> >     # cram-hmac-alg sha1;
> >     # shared-secret anysecrethere123;
> >     allow-two-primaries;
> >   }
> >   on node1 {
> >     device /dev/drbd0;
> >     disk /dev/sdb1;
> >     address 192.168.23.128:7789 <http://192.168.23.128:7789>;
> >     meta-disk internal;
> >   }
> >   on node2 {
> >     device /dev/drbd0;
> >     disk /dev/sdb1;
> >     address 192.168.23.129:7789 <http://192.168.23.129:7789>;
> >     meta-disk internal;
> >   }
> > }
> >
> > Regards,
> > Zohair Raza
> >
> >
> > On Tue, Oct 30, 2012 at 12:39 PM, Zohair Raza
> > <engineerzuhairraza at gmail.com <mailto:engineerzuhairraza at gmail.com>>
> wrote:
> >
> >     Hi,
> >
> >     thanks for explanation
> >
> >     On Mon, Oct 29, 2012 at 9:26 PM, Digimer <lists at alteeve.ca
> >     <mailto:lists at alteeve.ca>> wrote:
> >
> >         When a node stops responding, it can not be assumes to be dead.
> >         It has
> >         to be put into a known state, and that is what fencing does.
> >         Disabling
> >         fencing is like driving without a seatbelt.
> >
> >         Ya, it'll save you a bit of time at first, but the first time
> >         you get a
> >         split-brain, you are going right through the windshield. Will
> >         you think
> >         it was worth it then?
> >
> >         A split-brain is when neither node fails, but they can't
> communicate
> >         anymore. If each assumes the other is gone and begins using the
> >         shared
> >         storage without coordinating with it's peer, your data will be
> >         corrupted
> >         very, very quickly.
> >
> >
> >     In my scenario, I have two Samba servers in two different locations
> >     so chances of this are obvious
> >
> >
> >         Heck, even if all you had was a floating IP; disabling fencing
> means
> >         that both nodes would try to use that IP. As what your
> >         clients/switches/routers think of that.
> >
> >     I don't need floating IP, as I am not looking for high availability
> >     but two Samba servers synced with each other so roaming employees
> >     can have faster access to their files on both locations. I skipped
> >     fencing as per Mautris's suggestion but I still can't figure out why
> >     fencing daemon was not able to fence the other node.
> >
> >
> >         Please read this for more details;
> >
> >
> https://alteeve.ca/w/2-Node_Red_Hat_KVM_Cluster_Tutorial#Concept.3B_Fencing
> >
> >
> >     Will have a look
> >
> >
> >         digimer
> >
> >         On 10/29/2012 03:03 AM, Zohair Raza wrote:
> >         > Hi,
> >         >
> >         > I have setup a Primary/Primary cluster with GFS2.
> >         >
> >         > All works good if I shut down any node regularly, but when I
> >         unplug
> >         > power of any node, GFS freezes and I can not access the device.
> >         >
> >         > Tried to use http://people.redhat.com/lhh/obliterate
> >         >
> >         > this is what I see in logs
> >         >
> >         > Oct 29 08:05:41 node1 kernel: d-con res0: PingAck did not
> >         arrive in time.
> >         > Oct 29 08:05:41 node1 kernel: d-con res0: peer( Primary ->
> >         Unknown )
> >         > conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown
> >         ) susp( 0
> >         > -> 1 )
> >         > Oct 29 08:05:41 node1 kernel: d-con res0: asender terminated
> >         > Oct 29 08:05:41 node1 kernel: d-con res0: Terminating asender
> >         thread
> >         > Oct 29 08:05:41 node1 kernel: d-con res0: Connection closed
> >         > Oct 29 08:05:41 node1 kernel: d-con res0: conn( NetworkFailure
> ->
> >         > Unconnected )
> >         > Oct 29 08:05:41 node1 kernel: d-con res0: receiver terminated
> >         > Oct 29 08:05:41 node1 kernel: d-con res0: Restarting receiver
> >         thread
> >         > Oct 29 08:05:41 node1 kernel: d-con res0: receiver (re)started
> >         > Oct 29 08:05:41 node1 kernel: d-con res0: conn( Unconnected ->
> >         > WFConnection )
> >         > Oct 29 08:05:41 node1 kernel: d-con res0: helper command:
> >         /sbin/drbdadm
> >         > fence-peer res0
> >         > Oct 29 08:05:41 node1 fence_node[1912]: fence node2 failed
> >         > Oct 29 08:05:41 node1 kernel: d-con res0: helper command:
> >         /sbin/drbdadm
> >         > fence-peer res0 exit code 1 (0x100)
> >         > Oct 29 08:05:41 node1 kernel: d-con res0: fence-peer helper
> >         broken,
> >         > returned 1
> >         > Oct 29 08:05:48 node1 corosync[1346]:   [TOTEM ] A processor
> >         failed,
> >         > forming new configuration.
> >         > Oct 29 08:05:53 node1 corosync[1346]:   [QUORUM] Members[1]: 1
> >         > Oct 29 08:05:53 node1 corosync[1346]:   [TOTEM ] A processor
> >         joined or
> >         > left the membership and a new membership was formed.
> >         > Oct 29 08:05:53 node1 corosync[1346]:   [CPG   ] chosen
> >         downlist: sender
> >         > r(0) ip(192.168.23.128) ; members(old:2 left:1)
> >         > Oct 29 08:05:53 node1 corosync[1346]:   [MAIN  ] Completed
> service
> >         > synchronization, ready to provide service.
> >         > Oct 29 08:05:53 node1 kernel: dlm: closing connection to node 2
> >         > Oct 29 08:05:53 node1 fenced[1401]: fencing node node2
> >         > Oct 29 08:05:53 node1 kernel: GFS2: fsid=cluster-setup:res0.0:
> >         jid=1:
> >         > Trying to acquire journal lock...
> >         > Oct 29 08:05:53 node1 fenced[1401]: fence node2 dev 0.0 agent
> >         > fence_ack_manual result: error from agent
> >         > Oct 29 08:05:53 node1 fenced[1401]: fence node2 failed
> >         > Oct 29 08:05:56 node1 fenced[1401]: fencing node node2
> >         > Oct 29 08:05:56 node1 fenced[1401]: fence node2 dev 0.0 agent
> >         > fence_ack_manual result: error from agent
> >         > Oct 29 08:05:56 node1 fenced[1401]: fence node2 failed
> >         > Oct 29 08:05:59 node1 fenced[1401]: fencing node node2
> >         > Oct 29 08:05:59 node1 fenced[1401]: fence node2 dev 0.0 agent
> >         > fence_ack_manual result: error from agent
> >         > Oct 29 08:05:59 node1 fenced[1401]: fence node2 failed
> >         >
> >         > Regards,
> >         > Zohair Raza
> >         >
> >         >
> >         >
> >         > _______________________________________________
> >         > drbd-user mailing list
> >         > drbd-user at lists.linbit.com <mailto:drbd-user at lists.linbit.com>
> >         > http://lists.linbit.com/mailman/listinfo/drbd-user
> >         >
> >
> >
> >         --
> >         Digimer
> >         Papers and Projects: https://alteeve.ca/w/
> >         What if the cure for cancer is trapped in the mind of a person
> >         without
> >         access to education?
> >
> >
> >
> >
> >
> > _______________________________________________
> > drbd-user mailing list
> > drbd-user at lists.linbit.com
> > http://lists.linbit.com/mailman/listinfo/drbd-user
> >
>
>
> --
> Digimer
> Papers and Projects: https://alteeve.ca/w/
> What if the cure for cancer is trapped in the mind of a person without
> access to education?
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20121030/da295f81/attachment.htm>