[DRBD-user] DOPD problem and Heartbeat

Wed Apr 18 21:41:53 CEST 2012

> On Wed, Apr 18, 2012 at 07:55:32PM +0200, aluno3 at poczta.onet.pl wrote:
> > Hello
> > 
> > We are testing DOPD mechanism and reviewing source of the dopd file
> > (http://hg.linux-ha.org/lha-2.1/file/1d5b54f0a2e0/contrib/drbd-outdate-peer/dopd.c).
> 
> Would you please use heartbeat 3
> (and pacemaker, unless you use the haresource mode of heartbeat)
> 
> When using pacemaker, use the drbd crm-fence-peer.sh.
> It covers all the cases dopd would cover, and in fact even a couple more
> corner cases in multiple failure scenarios.

We would like to use heartbeat 3 with newer crm but our front end is not adapted yet...

> 
> > 
> > Is it ok that in function check_drbd_peer, during loop, at the
> > beginning is checking status of the node and in case if node is dead
> > then function is finishing with returning FALSE even if node is ping
> > node? Next part of the code checks if node is 'normal' node, but it
> > is to late.
> 
> Then I guess we have to fix that.

Maybe fix should look like:

--- ./heartbeat/contrib/drbd-outdate-peer/dopd.c  2008-08-18 14:32:19.000000000 +0200
+++ ./heartbeat-dopdfix/contrib/drbd-outdate-peer/dopd.c  2012-04-18 20:10:41.000000000 +0200
@@ -226,7 +226,7 @@ check_drbd_peer(const char *drbd_peer)
        }
        while((node = dopd_cluster_conn->llc_ops->nextnode(dopd_cluster_conn)) != NULL) {
                const char *status = dopd_cluster_conn->llc_ops->node_status(dopd_cluster_conn, node);
-               if (!strcmp(status, "dead")) {
+               if (!strcmp(status, "dead") && !strcmp("normal", dopd_cluster_conn->llc_ops->node_type(dopd_cluster_conn, node))) {
                        cl_log(LOG_WARNING, "Cluster node: %s: status: %s",
                               node, status);
                        return FALSE;

> 
> > In case when you have:
> > -configured ping node,
> > -timeouts: ping-int 10, deadping 10, deadtime 30
> > 
> > and link from replication, ping node down, dopd starts working. Function
> > check_drbd_peer checks if status of the node is dead (ping node is
> > dead, remote/normal node is ok) and if yes, ends with returning
> > FALSE and does not mark remote volumes as outdated with using other
> > auxiliary path. Unfortunately during test such problem occurred.
> > 
> > We know that DRBD timeouts have to be lower then heartbeat timeouts, but
> > in case when dopd has to mark a lot of remote resources, it cannot do
> > that in time. It is easy to race.
> 
> -- 
> : Lars Ellenberg
> : LINBIT | Your Way to High Availability
> : DRBD/HA support and consulting http://www.linbit.com
> 
> DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
> __
> please don't Cc me, but send to list   --   I'm subscribed
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user
>