[DRBD-user] GFS2 freezes

Zohair Raza engineerzuhairraza at gmail.com
Tue Oct 30 13:43:43 CET 2012

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


I have rebuild the setup, and enabled fencing

Manual fencing (fence_ack_manual) works okay when I fence one a dead node
from command line but it is not doing automatically

Logs:
Oct 30 12:05:52 node1 kernel: dlm: closing connection to node 2
Oct 30 12:05:52 node1 fenced[1414]: fencing node node2
Oct 30 12:05:52 node1 kernel: GFS2: fsid=cluster1:gfs.0: jid=1: Trying to
acquire journal lock...
Oct 30 12:05:52 node1 fenced[1414]: fence node2 dev 0.0 agent fence_manual
result: error from agent
Oct 30 12:05:52 node1 fenced[1414]: fence node2 failed
Oct 30 12:05:55 node1 fenced[1414]: fencing node node2

Cluster.conf:

<?xml version="1.0"?>
<cluster name="cluster1" config_version="3">
<cman two_node="1" expected_votes="1"/>
<clusternodes>
<clusternode name="node1" votes="1" nodeid="1">
        <fence>
                <method name="single">
                        <device name="manual" ipaddr="192.168.23.128"/>
                </method>
        </fence>
</clusternode>
<clusternode name="node2" votes="1" nodeid="2">
        <fence>
                <method name="single">
                        <device name="manual" ipaddr="192.168.23.129"/>
                </method>
        </fence>
</clusternode>
</clusternodes>
<fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
<fencedevices>
        <fencedevice name="manual" agent="fence_manual"/>
</fencedevices>
</cluster>

drbd.conf:

resource res0 {
  protocol C;
  startup {
    wfc-timeout 20;
    degr-wfc-timeout 10;
    # we will keep this commented until tested successfully:
     become-primary-on both;
  }
  net {
    # the encryption part can be omitted when using a dedicated link for
DRBD only:
    # cram-hmac-alg sha1;
    # shared-secret anysecrethere123;
    allow-two-primaries;
  }
  on node1 {
    device /dev/drbd0;
    disk /dev/sdb1;
    address 192.168.23.128:7789;
    meta-disk internal;
  }
  on node2 {
    device /dev/drbd0;
    disk /dev/sdb1;
    address 192.168.23.129:7789;
    meta-disk internal;
  }
}

Regards,
Zohair Raza


On Tue, Oct 30, 2012 at 12:39 PM, Zohair Raza
<engineerzuhairraza at gmail.com>wrote:

> Hi,
>
> thanks for explanation
>
> On Mon, Oct 29, 2012 at 9:26 PM, Digimer <lists at alteeve.ca> wrote:
>
>> When a node stops responding, it can not be assumes to be dead. It has
>> to be put into a known state, and that is what fencing does. Disabling
>> fencing is like driving without a seatbelt.
>>
>> Ya, it'll save you a bit of time at first, but the first time you get a
>> split-brain, you are going right through the windshield. Will you think
>> it was worth it then?
>>
>> A split-brain is when neither node fails, but they can't communicate
>> anymore. If each assumes the other is gone and begins using the shared
>> storage without coordinating with it's peer, your data will be corrupted
>> very, very quickly.
>>
>
> In my scenario, I have two Samba servers in two different locations so
> chances of this are obvious
>
>>
>> Heck, even if all you had was a floating IP; disabling fencing means
>> that both nodes would try to use that IP. As what your
>> clients/switches/routers think of that.
>>
>> I don't need floating IP, as I am not looking for high availability but
> two Samba servers synced with each other so roaming employees can have
> faster access to their files on both locations. I skipped fencing as per
> Mautris's suggestion but I still can't figure out why fencing daemon was
> not able to fence the other node.
>
>
>> Please read this for more details;
>>
>>
>> https://alteeve.ca/w/2-Node_Red_Hat_KVM_Cluster_Tutorial#Concept.3B_Fencing
>>
>>
> Will have a look
>
>
>> digimer
>>
>> On 10/29/2012 03:03 AM, Zohair Raza wrote:
>> > Hi,
>> >
>> > I have setup a Primary/Primary cluster with GFS2.
>> >
>> > All works good if I shut down any node regularly, but when I unplug
>> > power of any node, GFS freezes and I can not access the device.
>> >
>> > Tried to use http://people.redhat.com/lhh/obliterate
>> >
>> > this is what I see in logs
>> >
>> > Oct 29 08:05:41 node1 kernel: d-con res0: PingAck did not arrive in
>> time.
>> > Oct 29 08:05:41 node1 kernel: d-con res0: peer( Primary -> Unknown )
>> > conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) susp( 0
>> > -> 1 )
>> > Oct 29 08:05:41 node1 kernel: d-con res0: asender terminated
>> > Oct 29 08:05:41 node1 kernel: d-con res0: Terminating asender thread
>> > Oct 29 08:05:41 node1 kernel: d-con res0: Connection closed
>> > Oct 29 08:05:41 node1 kernel: d-con res0: conn( NetworkFailure ->
>> > Unconnected )
>> > Oct 29 08:05:41 node1 kernel: d-con res0: receiver terminated
>> > Oct 29 08:05:41 node1 kernel: d-con res0: Restarting receiver thread
>> > Oct 29 08:05:41 node1 kernel: d-con res0: receiver (re)started
>> > Oct 29 08:05:41 node1 kernel: d-con res0: conn( Unconnected ->
>> > WFConnection )
>> > Oct 29 08:05:41 node1 kernel: d-con res0: helper command: /sbin/drbdadm
>> > fence-peer res0
>> > Oct 29 08:05:41 node1 fence_node[1912]: fence node2 failed
>> > Oct 29 08:05:41 node1 kernel: d-con res0: helper command: /sbin/drbdadm
>> > fence-peer res0 exit code 1 (0x100)
>> > Oct 29 08:05:41 node1 kernel: d-con res0: fence-peer helper broken,
>> > returned 1
>> > Oct 29 08:05:48 node1 corosync[1346]:   [TOTEM ] A processor failed,
>> > forming new configuration.
>> > Oct 29 08:05:53 node1 corosync[1346]:   [QUORUM] Members[1]: 1
>> > Oct 29 08:05:53 node1 corosync[1346]:   [TOTEM ] A processor joined or
>> > left the membership and a new membership was formed.
>> > Oct 29 08:05:53 node1 corosync[1346]:   [CPG   ] chosen downlist: sender
>> > r(0) ip(192.168.23.128) ; members(old:2 left:1)
>> > Oct 29 08:05:53 node1 corosync[1346]:   [MAIN  ] Completed service
>> > synchronization, ready to provide service.
>> > Oct 29 08:05:53 node1 kernel: dlm: closing connection to node 2
>> > Oct 29 08:05:53 node1 fenced[1401]: fencing node node2
>> > Oct 29 08:05:53 node1 kernel: GFS2: fsid=cluster-setup:res0.0: jid=1:
>> > Trying to acquire journal lock...
>> > Oct 29 08:05:53 node1 fenced[1401]: fence node2 dev 0.0 agent
>> > fence_ack_manual result: error from agent
>> > Oct 29 08:05:53 node1 fenced[1401]: fence node2 failed
>> > Oct 29 08:05:56 node1 fenced[1401]: fencing node node2
>> > Oct 29 08:05:56 node1 fenced[1401]: fence node2 dev 0.0 agent
>> > fence_ack_manual result: error from agent
>> > Oct 29 08:05:56 node1 fenced[1401]: fence node2 failed
>> > Oct 29 08:05:59 node1 fenced[1401]: fencing node node2
>> > Oct 29 08:05:59 node1 fenced[1401]: fence node2 dev 0.0 agent
>> > fence_ack_manual result: error from agent
>> > Oct 29 08:05:59 node1 fenced[1401]: fence node2 failed
>> >
>> > Regards,
>> > Zohair Raza
>> >
>> >
>> >
>> > _______________________________________________
>> > drbd-user mailing list
>> > drbd-user at lists.linbit.com
>> > http://lists.linbit.com/mailman/listinfo/drbd-user
>> >
>>
>>
>> --
>> Digimer
>> Papers and Projects: https://alteeve.ca/w/
>> What if the cure for cancer is trapped in the mind of a person without
>> access to education?
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20121030/5b587dc0/attachment.htm>


More information about the drbd-user mailing list