Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Thank you for the answer Lars, but yesterday I solved without the
outdate peer handler but adding in my drbd.conf "after-sb-1pri
discard-secondary;" directive.
This is my drbd.conf:
/resource ovHA {
protocol C;
startup { wfc-timeout 60; degr-wfc-timeout 120; }
disk { on-io-error detach;
}
net {
ko-count 4;
timeout 80; # unit: 0.1 seconds
connect-int 10; # unit: seconds
ping-int 10; # unit: seconds
ko-count 4;
max-buffers 4096;
max-epoch-size 2048;
after-sb-0pri discard-older-primary;
*after-sb-1pri discard-secondary;*
}
syncer {
rate 100M;
}
on OV-HA1 {
device /dev/drbd0;
disk /dev/hda2;
address 192.168.0.58:8000;
meta-disk internal;
}
on OV-HA2 {
device /dev/drbd0;
disk /dev/hda2;
address 192.168.0.59:8000;
meta-disk internal;
}
}
/This scenario is for test purpose, in production obviously I will have
2 ethernet :)
Cheers,
Matteo.
Lars Ellenberg ha scritto:
> On Mon, Oct 15, 2007 at 04:00:17PM +0200, Matteo Campana wrote:
>
>> Hi all,
>>
>> following the example in this Florian's post: (http://fghaas.wordpress.com/2007
>> /10/01/an-underrated-cluster-admins-companion-dopd/) I'm testing the
>> outdate-peer plugin.
>>
>> My scenario: two debian machines (OV-HA1 primary, OV-HA2 secondary) ,
>> heartbeat+drbd, 1 ethernet + 1 serial cable (the ethernet is used both for drbd
>> replication and to expose services).
>> I also know that a dedicated ethernet connections between the two nodes is
>> recommended for drdb data synchronization, but for testing use this is the
>> scenario :).
>> Heartbeat is configured with ipfail, so when the ethernet connection goes
>> down, heartbeat migrate the services to the other node.
>>
>> Obviusly in this configuration the troubles appears when I unplug the OV-HA1
>> (primary) link: I'm testing the outdate-peer daemon as I read on your post
>> because without this plugin the secondary becames primary (and this is OK) ,
>> but when I reconnect the ethernet the 2 nodes are "standalone" and not
>> re-syncronize their drbd partitions (this is the case of "drbd split brain").
>> Now with your post's configuration:
>>
>> . in OV-HA2's ha-log I see this warning WARN: check_drbd_peer: drbd peer
>> OV-HA1 was not found;
>> . however the plugin seems to work, because my OV-HA2 is now outdated;
>> . after the log message above, I see in OV-HA2's ha-log:
>> ResourceManager[6217]: 2007/10/15_14:54:47 ERROR: Return code 20 from /etc
>> /ha.d/resource.d/drbddisk
>> ResourceManager[6217]: 2007/10/15_14:54:47 CRIT: Giving up resources due
>> to failure of drbddisk::ovHA
>> . investigating the syslog I see that OV-HA2 fails to become primary
>>
>> Oct 15 14:54:47 localhost
>> kernel: drbd0: State change failed: Refusing to be Primary without at least
>> one UpToDate disk
>> Oct 15 14:54:47 localhost kernel: drbd0: state = { cs:WFConnection
>> st:Secondary/Unknown ds:Outdated/DUnknown r--- }
>> Oct 15 14:54:47 localhost kernel: drbd0: wanted = { cs:WFConnection
>> st:Primary/Unknown ds:Outdated/DUnknown r--- }
>> Oct 15 14:54:47 localhost kernel: ttyS0: 1 input overrun(s)
>> Oct 15 14:54:47 localhost ResourceManager[6217]: debug: /etc/ha.d/
>> resource.d/drbddisk ovHA start done. RC=20
>> Oct 15 14:54:47 localhost ResourceManager[6217]: ERROR: Return code 20 from
>> /etc/ha.d/resource.d/drbddisk
>> Oct 15 14:54:47 localhost ResourceManager[6217]: CRIT: Giving up resources
>> due to failure of drbddisk::ovHA
>>
>> It is correct that now in my scenario:
>>
>> . the plugin outdate the secondary when etherner fails;
>> . the secondary fails to become primary because now it is marked as
>> "outdated" :)
>>
>>
>> Is there a solution?
>>
>
>
> very specific for exactly your scenario as I understand it:
> it is called "suicide".
> implementations of that can be found in e.g. OCFS2.
> when you lose outside connectivity, your setup implies you lost
> data-replication as well.
> so you can safely comit suicide.
>
> in the drbd outdate peer handler,
> instead of trying to outdate the peer,
> shout yourself in the head.
>
> you could also try to let heartbeat do the suicide for you,
> it already has a few scenarios where it does it (e.g. repeated failed stops).
>
> something like
> "echo 1 > /proc/sys/kernel/sysrq; echo o > /proc/sysrq-trigger;"
> should do the trick.
>
>
> but I really recommend to fix the deployment instead.
>
> :)
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20071016/360091a6/attachment.htm>