[DRBD-user] Automatic recover dopd after split brain recover

Thu May 28 19:49:12 CEST 2009

Can you check it please?

May 27 19:38:35 ha2 heartbeat: [2426]: ERROR: glib: Unable to send bcast
[-1] packet(len=217): No such device
May 27 19:38:35 ha2 heartbeat: [2426]: ERROR: write_child: write failure on
bcast eth1.: No such device
May 27 19:38:37 ha2 heartbeat: [2426]: ERROR: glib: Unable to send bcast
[-1] packet(len=217): No such device
May 27 19:38:37 ha2 heartbeat: [2426]: ERROR: write_child: write failure on
bcast eth1.: No such device
May 27 19:38:38 ha2 kernel: drbd0: PingAck did not arrive in time.
May 27 19:38:38 ha2 kernel: drbd0: peer( Primary -> Unknown ) conn(
Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
May 27 19:38:38 ha2 kernel: drbd0: asender terminated
May 27 19:38:38 ha2 kernel: drbd0: Terminating asender thread
May 27 19:38:38 ha2 kernel: drbd0: short read expecting header on sock:
r=-512
May 27 19:38:38 ha2 kernel: drbd0: Writing meta data super block now.
May 27 19:38:38 ha2 kernel: drbd0: tl_clear()
May 27 19:38:38 ha2 kernel: drbd0: Connection closed
May 27 19:38:38 ha2 kernel: drbd0: conn( NetworkFailure -> Unconnected )
May 27 19:38:38 ha2 kernel: drbd0: receiver terminated
May 27 19:38:38 ha2 kernel: drbd0: receiver (re)started
May 27 19:38:38 ha2 kernel: drbd0: conn( Unconnected -> WFConnection )
May 27 19:38:38 ha2 kernel: drbd0: Unable to bind source sock (-99)
May 27 19:38:38 ha2 last message repeated 2 times
May 27 19:38:38 ha2 kernel: drbd0: Unable to bind sock2 (-99)
May 27 19:38:38 ha2 kernel: drbd0: conn( WFConnection -> Disconnecting )
May 27 19:38:38 ha2 kernel: drbd0: Discarding network configuration.
May 27 19:38:38 ha2 kernel: drbd0: tl_clear()
May 27 19:38:38 ha2 kernel: drbd0: Connection closed
May 27 19:38:38 ha2 kernel: drbd0: conn( Disconnecting -> StandAlone )
May 27 19:38:38 ha2 kernel: drbd0: receiver terminated
May 27 19:38:38 ha2 kernel: drbd0: Terminating receiver thread
May 27 19:38:39 ha2 heartbeat: [2426]: ERROR: glib: Unable to send bcast
[-1] packet(len=217): No such device
May 27 19:38:39 ha2 heartbeat: [2426]: ERROR: write_child: write failure on
bcast eth1.: No such device
May 27 19:38:40 ha2 kernel: drbd0: disk( UpToDate -> Outdated )
May 27 19:38:40 ha2 kernel: drbd0: Writing meta data super block now.
May 27 19:38:40 ha2 /usr/lib/heartbeat/dopd: [2513]: info: sending return
code: 4, ha2.teste.local -> ha1.teste.local
May 27 19:38:41 ha2 heartbeat: [2426]: ERROR: glib: Unable to send bcast
[-1] packet(len=310): No such device
May 27 19:38:41 ha2 heartbeat: [2426]: ERROR: write_child: write failure on
bcast eth1.: No such device
May 27 19:38:41 ha2 heartbeat: [2426]: ERROR: glib: Unable to send bcast
[-1] packet(len=217): No such device
May 27 19:38:41 ha2 heartbeat: [2426]: ERROR: write_child: write failure on
bcast eth1.: No such device
May 27 19:38:43 ha2 heartbeat: [2426]: ERROR: glib: Unable to send bcast
[-1] packet(len=217): No such device
May 27 19:38:43 ha2 heartbeat: [2426]: ERROR: write_child: write failure on
bcast eth1.: No such device
May 27 19:38:45 ha2 heartbeat: [2408]: info: Link ha1.teste.local:eth1 dead.
May 27 19:38:45 ha2 ipfail: [2514]: info: Link Status update: Link
ha1.teste.local/eth1 now has status dead
May 27 19:38:45 ha2 heartbeat: [2426]: ERROR: glib: Unable to send bcast
[-1] packet(len=217): No such device
May 27 19:38:45 ha2 heartbeat: [2426]: ERROR: write_child: write failure on
bcast eth1.: No such device
May 27 19:38:46 ha2 ipfail: [2514]: info: Asking other side for ping node
count.
May 27 19:38:46 ha2 ipfail: [2514]: info: Checking remote count of ping
nodes.
May 27 19:38:46 ha2 heartbeat: [2426]: ERROR: glib: Unable to send bcast
[-1] packet(len=223): No such device
May 27 19:38:46 ha2 heartbeat: [2426]: ERROR: write_child: write failure on
bcast eth1.: No such device
May 27 19:38:46 ha2 heartbeat: [2426]: WARN: Temporarily Suppressing write
error messages
May 27 19:38:46 ha2 heartbeat: [2426]: WARN: Is a cable unplugged on bcast
eth1?
May 27 19:38:47 ha2 ipfail: [2514]: info: Ping node count is balanced.
May 27 19:38:48 ha2 ipfail: [2514]: info: No giveup timer to abort.
May 27 19:39:06 ha2 kernel: eth1: link up

Regards,
Pedro Sousa

On Thu, May 28, 2009 at 4:51 PM, Lars Ellenberg
<lars.ellenberg at linbit.com>wrote:

> On Thu, May 28, 2009 at 01:46:43PM +0100, Pedro Sousa wrote:
> > Hi,
> >
> > I'm testing split-brain in a master/slave scenario with dopd and have
> some
> > doubts about the automatic recovery procedure. The steps I took were:
> >
> > 1º Unplug the crossover cable
> >
> > Master:
> >
> > Primary/Unknown ds:UpToDate/Outdated
> >
> > Slave:
> >
> > StandAlone ro:Secondary/Unknown ds:Consistent/DUnknown
> >
> > 2º Plug the cable back on:
> >
> > Both nodes remain with the same state: Update/Outdated and
> > Consistent/Unknown
> >
> > My question is: shouldn't the slave rejoin/resync to the master
> > automatically after I plug the cable?
> >
> > I have to manually  run: "drbdadm adjust all" to recover it.
>
> once a node reaches "StandAlone",
> you have to tell it to try and reconnect, yes.
>
> so this is how it is supposed to be.
>
> why it goes to "StandAlone" should be in the logs.
>
> > My conf (centos 5.3; drbd 8.3.1; heartbeat 2.99)
> >
> > /etc/drbd.conf
>
> </snip>
>
>
> --
> : Lars Ellenberg
> : LINBIT | Your Way to High Availability
> : DRBD/HA support and consulting http://www.linbit.com
>
> DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
> __
> please don't Cc me, but send to list   --   I'm subscribed
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20090528/087b6c1a/attachment.htm>