Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Thank you for the answer Lars, but yesterday I solved without the outdate peer handler but adding in my drbd.conf "after-sb-1pri discard-secondary;" directive. This is my drbd.conf: /resource ovHA { protocol C; startup { wfc-timeout 60; degr-wfc-timeout 120; } disk { on-io-error detach; } net { ko-count 4; timeout 80; # unit: 0.1 seconds connect-int 10; # unit: seconds ping-int 10; # unit: seconds ko-count 4; max-buffers 4096; max-epoch-size 2048; after-sb-0pri discard-older-primary; *after-sb-1pri discard-secondary;* } syncer { rate 100M; } on OV-HA1 { device /dev/drbd0; disk /dev/hda2; address 192.168.0.58:8000; meta-disk internal; } on OV-HA2 { device /dev/drbd0; disk /dev/hda2; address 192.168.0.59:8000; meta-disk internal; } } /This scenario is for test purpose, in production obviously I will have 2 ethernet :) Cheers, Matteo. Lars Ellenberg ha scritto: > On Mon, Oct 15, 2007 at 04:00:17PM +0200, Matteo Campana wrote: > >> Hi all, >> >> following the example in this Florian's post: (http://fghaas.wordpress.com/2007 >> /10/01/an-underrated-cluster-admins-companion-dopd/) I'm testing the >> outdate-peer plugin. >> >> My scenario: two debian machines (OV-HA1 primary, OV-HA2 secondary) , >> heartbeat+drbd, 1 ethernet + 1 serial cable (the ethernet is used both for drbd >> replication and to expose services). >> I also know that a dedicated ethernet connections between the two nodes is >> recommended for drdb data synchronization, but for testing use this is the >> scenario :). >> Heartbeat is configured with ipfail, so when the ethernet connection goes >> down, heartbeat migrate the services to the other node. >> >> Obviusly in this configuration the troubles appears when I unplug the OV-HA1 >> (primary) link: I'm testing the outdate-peer daemon as I read on your post >> because without this plugin the secondary becames primary (and this is OK) , >> but when I reconnect the ethernet the 2 nodes are "standalone" and not >> re-syncronize their drbd partitions (this is the case of "drbd split brain"). >> Now with your post's configuration: >> >> . in OV-HA2's ha-log I see this warning WARN: check_drbd_peer: drbd peer >> OV-HA1 was not found; >> . however the plugin seems to work, because my OV-HA2 is now outdated; >> . after the log message above, I see in OV-HA2's ha-log: >> ResourceManager[6217]: 2007/10/15_14:54:47 ERROR: Return code 20 from /etc >> /ha.d/resource.d/drbddisk >> ResourceManager[6217]: 2007/10/15_14:54:47 CRIT: Giving up resources due >> to failure of drbddisk::ovHA >> . investigating the syslog I see that OV-HA2 fails to become primary >> >> Oct 15 14:54:47 localhost >> kernel: drbd0: State change failed: Refusing to be Primary without at least >> one UpToDate disk >> Oct 15 14:54:47 localhost kernel: drbd0: state = { cs:WFConnection >> st:Secondary/Unknown ds:Outdated/DUnknown r--- } >> Oct 15 14:54:47 localhost kernel: drbd0: wanted = { cs:WFConnection >> st:Primary/Unknown ds:Outdated/DUnknown r--- } >> Oct 15 14:54:47 localhost kernel: ttyS0: 1 input overrun(s) >> Oct 15 14:54:47 localhost ResourceManager[6217]: debug: /etc/ha.d/ >> resource.d/drbddisk ovHA start done. RC=20 >> Oct 15 14:54:47 localhost ResourceManager[6217]: ERROR: Return code 20 from >> /etc/ha.d/resource.d/drbddisk >> Oct 15 14:54:47 localhost ResourceManager[6217]: CRIT: Giving up resources >> due to failure of drbddisk::ovHA >> >> It is correct that now in my scenario: >> >> . the plugin outdate the secondary when etherner fails; >> . the secondary fails to become primary because now it is marked as >> "outdated" :) >> >> >> Is there a solution? >> > > > very specific for exactly your scenario as I understand it: > it is called "suicide". > implementations of that can be found in e.g. OCFS2. > when you lose outside connectivity, your setup implies you lost > data-replication as well. > so you can safely comit suicide. > > in the drbd outdate peer handler, > instead of trying to outdate the peer, > shout yourself in the head. > > you could also try to let heartbeat do the suicide for you, > it already has a few scenarios where it does it (e.g. repeated failed stops). > > something like > "echo 1 > /proc/sys/kernel/sysrq; echo o > /proc/sysrq-trigger;" > should do the trick. > > > but I really recommend to fix the deployment instead. > > :) > > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20071016/360091a6/attachment.htm>