Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On 18.04.2012 21:49, Lars Ellenberg wrote: > On Wed, Apr 18, 2012 at 09:41:53PM +0200, aluno3 wrote: >>> On Wed, Apr 18, 2012 at 07:55:32PM +0200, aluno3 at poczta.onet.pl wrote: >>>> Hello >>>> >>>> We are testing DOPD mechanism and reviewing source of the dopd file >>>> (http://hg.linux-ha.org/lha-2.1/file/1d5b54f0a2e0/contrib/drbd-outdate-peer/dopd.c). >>> Would you please use heartbeat 3 >>> (and pacemaker, unless you use the haresource mode of heartbeat) >>> >>> When using pacemaker, use the drbd crm-fence-peer.sh. >>> It covers all the cases dopd would cover, and in fact even a couple more >>> corner cases in multiple failure scenarios. >> We would like to use heartbeat 3 with newer crm but our front end is not adapted yet... >> >>>> Is it ok that in function check_drbd_peer, during loop, at the >>>> beginning is checking status of the node and in case if node is dead >>>> then function is finishing with returning FALSE even if node is ping >>>> node? Next part of the code checks if node is 'normal' node, but it >>>> is to late. >>> Then I guess we have to fix that. >> Maybe fix should look like: >> >> --- ./heartbeat/contrib/drbd-outdate-peer/dopd.c 2008-08-18 14:32:19.000000000 +0200 >> +++ ./heartbeat-dopdfix/contrib/drbd-outdate-peer/dopd.c 2012-04-18 20:10:41.000000000 +0200 >> @@ -226,7 +226,7 @@ check_drbd_peer(const char *drbd_peer) >> } >> while((node = dopd_cluster_conn->llc_ops->nextnode(dopd_cluster_conn)) != NULL) { >> const char *status = dopd_cluster_conn->llc_ops->node_status(dopd_cluster_conn, node); >> - if (!strcmp(status, "dead")) { >> + if (!strcmp(status, "dead")&& !strcmp("normal", dopd_cluster_conn->llc_ops->node_type(dopd_cluster_conn, node))) { >> cl_log(LOG_WARNING, "Cluster node: %s: status: %s", >> node, status); >> return FALSE; > I'd say, it should rather look like (against heartbeat 3 source, so it > may or may not directly apply on your tree; probably best to just copy > over all of contrib/drbd-outdate-peer from 3): > > diff --git a/contrib/drbd-outdate-peer/dopd.c b/contrib/drbd-outdate-peer/dopd.c > --- a/contrib/drbd-outdate-peer/dopd.c > +++ b/contrib/drbd-outdate-peer/dopd.c > @@ -226,19 +226,26 @@ check_drbd_peer(const char *drbd_peer) > } > while((node = dopd_cluster_conn->llc_ops->nextnode(dopd_cluster_conn)) != NULL) { > const char *status = dopd_cluster_conn->llc_ops->node_status(dopd_cluster_conn, node); > + > + /* Look for the peer */ > + if (strcasecmp(node, drbd_peer)) > + continue; > + > + if (strcmp("normal", dopd_cluster_conn->llc_ops->node_type(dopd_cluster_conn, node))) { > + cl_log(LOG_WARNING, "Cluster node: %s: status: %s is not a normal node", > + node, status); > + break; > + } > + > if (!strcmp(status, "dead")) { > cl_log(LOG_WARNING, "Cluster node: %s: status: %s", > node, status); > - return FALSE; > + break; > } > > - /* Look for the peer */ > - if (!strcmp("normal", dopd_cluster_conn->llc_ops->node_type(dopd_cluster_conn, node)) > - && !strcasecmp(node, drbd_peer)) { > - cl_log(LOG_DEBUG, "node %s found\n", node); > - found = TRUE; > - break; > - } > + cl_log(LOG_DEBUG, "node %s found with status %s\n", node, status); > + found = TRUE; > + break; > } > if (dopd_cluster_conn->llc_ops->end_nodewalk(dopd_cluster_conn) != HA_OK) { > cl_log(LOG_INFO, "Cannot end node walk"); > > > Not even compile tested, but I think this is what it should look like. > After fast test, looks like fix is working. Thanks for help. >>>> In case when you have: >>>> -configured ping node, >>>> -timeouts: ping-int 10, deadping 10, deadtime 30 >>>> >>>> and link from replication, ping node down, dopd starts working. Function >>>> check_drbd_peer checks if status of the node is dead (ping node is >>>> dead, remote/normal node is ok) and if yes, ends with returning >>>> FALSE and does not mark remote volumes as outdated with using other >>>> auxiliary path. Unfortunately during test such problem occurred. >>>> >>>> We know that DRBD timeouts have to be lower then heartbeat timeouts, but >>>> in case when dopd has to mark a lot of remote resources, it cannot do >>>> that in time. It is easy to race. >>> -- >>> : Lars Ellenberg >>> : LINBIT | Your Way to High Availability >>> : DRBD/HA support and consulting http://www.linbit.com >>> >>> DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. >>> __ >>> please don't Cc me, but send to list -- I'm subscribed >>> _______________________________________________ >>> drbd-user mailing list >>> drbd-user at lists.linbit.com >>> http://lists.linbit.com/mailman/listinfo/drbd-user >>> >> _______________________________________________ >> drbd-user mailing list >> drbd-user at lists.linbit.com >> http://lists.linbit.com/mailman/listinfo/drbd-user