[DRBD-user] DOPD problem and Heartbeat

Wed Apr 18 22:40:24 CEST 2012

On 18.04.2012 21:49, Lars Ellenberg wrote:
> On Wed, Apr 18, 2012 at 09:41:53PM +0200, aluno3 wrote:
>>> On Wed, Apr 18, 2012 at 07:55:32PM +0200, aluno3 at poczta.onet.pl wrote:
>>>> Hello
>>>>
>>>> We are testing DOPD mechanism and reviewing source of the dopd file
>>>> (http://hg.linux-ha.org/lha-2.1/file/1d5b54f0a2e0/contrib/drbd-outdate-peer/dopd.c).
>>> Would you please use heartbeat 3
>>> (and pacemaker, unless you use the haresource mode of heartbeat)
>>>
>>> When using pacemaker, use the drbd crm-fence-peer.sh.
>>> It covers all the cases dopd would cover, and in fact even a couple more
>>> corner cases in multiple failure scenarios.
>> We would like to use heartbeat 3 with newer crm but our front end is not adapted yet...
>>
>>>> Is it ok that in function check_drbd_peer, during loop, at the
>>>> beginning is checking status of the node and in case if node is dead
>>>> then function is finishing with returning FALSE even if node is ping
>>>> node? Next part of the code checks if node is 'normal' node, but it
>>>> is to late.
>>> Then I guess we have to fix that.
>> Maybe fix should look like:
>>
>> --- ./heartbeat/contrib/drbd-outdate-peer/dopd.c  2008-08-18 14:32:19.000000000 +0200
>> +++ ./heartbeat-dopdfix/contrib/drbd-outdate-peer/dopd.c  2012-04-18 20:10:41.000000000 +0200
>> @@ -226,7 +226,7 @@ check_drbd_peer(const char *drbd_peer)
>>          }
>>          while((node = dopd_cluster_conn->llc_ops->nextnode(dopd_cluster_conn)) != NULL) {
>>                  const char *status = dopd_cluster_conn->llc_ops->node_status(dopd_cluster_conn, node);
>> -               if (!strcmp(status, "dead")) {
>> +               if (!strcmp(status, "dead")&&  !strcmp("normal", dopd_cluster_conn->llc_ops->node_type(dopd_cluster_conn, node))) {
>>                          cl_log(LOG_WARNING, "Cluster node: %s: status: %s",
>>                                 node, status);
>>                          return FALSE;
> I'd say, it should rather look like (against heartbeat 3 source, so it
> may or may not directly apply on your tree; probably best to just copy
> over all of contrib/drbd-outdate-peer from 3):
>
> diff --git a/contrib/drbd-outdate-peer/dopd.c b/contrib/drbd-outdate-peer/dopd.c
> --- a/contrib/drbd-outdate-peer/dopd.c
> +++ b/contrib/drbd-outdate-peer/dopd.c
> @@ -226,19 +226,26 @@ check_drbd_peer(const char *drbd_peer)
>   	}
>   	while((node = dopd_cluster_conn->llc_ops->nextnode(dopd_cluster_conn)) != NULL) {
>   		const char *status = dopd_cluster_conn->llc_ops->node_status(dopd_cluster_conn, node);
> +
> +		/* Look for the peer */
> +		if (strcasecmp(node, drbd_peer))
> +			continue;
> +
> +		if (strcmp("normal", dopd_cluster_conn->llc_ops->node_type(dopd_cluster_conn, node))) {
> +			cl_log(LOG_WARNING, "Cluster node: %s: status: %s is not a normal node",
> +			       node, status);
> +			break;
> +		}
> +
>   		if (!strcmp(status, "dead")) {
>   			cl_log(LOG_WARNING, "Cluster node: %s: status: %s",
>   			       node, status);
> -			return FALSE;
> +			break;
>   		}
>
> -		/* Look for the peer */
> -		if (!strcmp("normal", dopd_cluster_conn->llc_ops->node_type(dopd_cluster_conn, node))
> -			&&  !strcasecmp(node, drbd_peer)) {
> -			cl_log(LOG_DEBUG, "node %s found\n", node);
> -			found = TRUE;
> -			break;
> -		}
> +		cl_log(LOG_DEBUG, "node %s found with status %s\n", node, status);
> +		found = TRUE;
> +		break;
>   	}
>   	if (dopd_cluster_conn->llc_ops->end_nodewalk(dopd_cluster_conn) != HA_OK) {
>   		cl_log(LOG_INFO, "Cannot end node walk");
>
>
> Not even compile tested, but I think this is what it should look like.
>
After fast test, looks like fix is working. Thanks for help.

>>>> In case when you have:
>>>> -configured ping node,
>>>> -timeouts: ping-int 10, deadping 10, deadtime 30
>>>>
>>>> and link from replication, ping node down, dopd starts working. Function
>>>> check_drbd_peer checks if status of the node is dead (ping node is
>>>> dead, remote/normal node is ok) and if yes, ends with returning
>>>> FALSE and does not mark remote volumes as outdated with using other
>>>> auxiliary path. Unfortunately during test such problem occurred.
>>>>
>>>> We know that DRBD timeouts have to be lower then heartbeat timeouts, but
>>>> in case when dopd has to mark a lot of remote resources, it cannot do
>>>> that in time. It is easy to race.
>>> -- 
>>> : Lars Ellenberg
>>> : LINBIT | Your Way to High Availability
>>> : DRBD/HA support and consulting http://www.linbit.com
>>>
>>> DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
>>> __
>>> please don't Cc me, but send to list   --   I'm subscribed
>>> _______________________________________________
>>> drbd-user mailing list
>>> drbd-user at lists.linbit.com
>>> http://lists.linbit.com/mailman/listinfo/drbd-user
>>>
>> _______________________________________________
>> drbd-user mailing list
>> drbd-user at lists.linbit.com
>> http://lists.linbit.com/mailman/listinfo/drbd-user