[DRBD-user] drbd peer outdated plugin

Tue Oct 16 09:35:54 CEST 2007

Thank you for the answer Lars, but yesterday I solved without the 
outdate peer handler but adding in my drbd.conf "after-sb-1pri 
discard-secondary;" directive.
This is my drbd.conf:

/resource ovHA {
    protocol      C;

    startup { wfc-timeout 60; degr-wfc-timeout 120; }
    disk { on-io-error detach;
    }
    net {
    ko-count 4;
    timeout     80;    # unit: 0.1 seconds
    connect-int  10;    # unit: seconds
    ping-int     10;    # unit: seconds
    ko-count     4;
    max-buffers 4096;
    max-epoch-size 2048;
    after-sb-0pri discard-older-primary;
    *after-sb-1pri discard-secondary;*
    }

    syncer {
    rate 100M;
      }

     on OV-HA1 {
        device      /dev/drbd0;
        disk        /dev/hda2;
        address     192.168.0.58:8000;
        meta-disk   internal;
        }

      on OV-HA2 {
          device      /dev/drbd0;
        disk        /dev/hda2;
        address     192.168.0.59:8000;
        meta-disk   internal;
        }
    }

/This scenario is for test purpose, in production obviously I will have 
2 ethernet :)
Cheers,
Matteo.

Lars Ellenberg ha scritto:
> On Mon, Oct 15, 2007 at 04:00:17PM +0200, Matteo Campana wrote:
>   
>> Hi all,
>>
>> following the example in this Florian's post: (http://fghaas.wordpress.com/2007
>> /10/01/an-underrated-cluster-admins-companion-dopd/) I'm testing the
>> outdate-peer plugin.
>>
>> My scenario: two debian machines (OV-HA1 primary, OV-HA2 secondary) ,
>> heartbeat+drbd, 1 ethernet + 1 serial cable (the ethernet is used both for drbd
>> replication and to expose services).
>> I also know that a dedicated ethernet connections between the two nodes is
>> recommended for drdb data synchronization, but for testing use this is the
>> scenario :).
>> Heartbeat is configured with ipfail, so when the ethernet connection goes 
>> down,  heartbeat  migrate the services  to the  other node.
>>
>> Obviusly in this configuration the troubles appears when I unplug the OV-HA1
>> (primary) link: I'm testing the outdate-peer daemon as I read on your post
>> because without this plugin the secondary becames primary (and this is OK) ,
>> but when I reconnect the ethernet the 2 nodes are "standalone" and not
>> re-syncronize their drbd partitions (this is the case of "drbd split brain").
>> Now with your post's configuration:
>>
>>   . in OV-HA2's ha-log  I see this warning  WARN: check_drbd_peer: drbd peer
>>     OV-HA1 was not found;
>>   . however the plugin seems to work, because my OV-HA2 is now outdated;
>>   . after the log message above, I see in OV-HA2's ha-log:
>>     ResourceManager[6217]:  2007/10/15_14:54:47 ERROR: Return code 20 from /etc
>>     /ha.d/resource.d/drbddisk
>>     ResourceManager[6217]:  2007/10/15_14:54:47 CRIT: Giving up resources due
>>     to failure of drbddisk::ovHA
>>   . investigating the syslog I see that OV-HA2 fails to become primary         
>>                                                                                
>>                                                    Oct 15 14:54:47 localhost
>>     kernel: drbd0: State change failed: Refusing to be Primary without at least
>>     one UpToDate disk
>>     Oct 15 14:54:47 localhost kernel: drbd0:   state = { cs:WFConnection
>>     st:Secondary/Unknown ds:Outdated/DUnknown r--- }
>>     Oct 15 14:54:47 localhost kernel: drbd0:  wanted = { cs:WFConnection
>>     st:Primary/Unknown ds:Outdated/DUnknown r--- }
>>     Oct 15 14:54:47 localhost kernel: ttyS0: 1 input overrun(s)
>>     Oct 15 14:54:47 localhost ResourceManager[6217]: debug: /etc/ha.d/
>>     resource.d/drbddisk ovHA start done. RC=20
>>     Oct 15 14:54:47 localhost ResourceManager[6217]: ERROR: Return code 20 from
>>     /etc/ha.d/resource.d/drbddisk
>>     Oct 15 14:54:47 localhost ResourceManager[6217]: CRIT: Giving up resources
>>     due to failure of drbddisk::ovHA
>>
>> It is correct that now in my scenario:
>>
>>   . the plugin outdate the secondary when etherner fails;
>>   . the secondary fails to become  primary because  now it is marked as
>>     "outdated" :)
>>
>>
>> Is there a solution?
>>     
>
>
> very specific for exactly your scenario as I understand it:
> it is called "suicide".
> implementations of that can be found in e.g. OCFS2.
> when you lose outside connectivity, your setup implies you lost
> data-replication as well.
> so you can safely comit suicide.
>
> in the drbd outdate peer handler,
> instead of trying to outdate the peer, 
> shout yourself in the head.
>
> you could also try to let heartbeat do the suicide for you,
> it already has a few scenarios where it does it (e.g. repeated failed stops).
>
> something like
>  "echo 1 > /proc/sys/kernel/sysrq; echo o > /proc/sysrq-trigger;"
> should do the trick.
>
>
> but I really recommend to fix the deployment instead.
>
>   :)
>
>   

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20071016/360091a6/attachment.htm>