[DRBD-user] Automatic recover dopd after split brain recover

Stefan Seifert nine at detonation.org
Wed Jun 3 12:59:17 CEST 2009

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


On Wednesday 03 June 2009 12:40:23 Pedro Sousa wrote:
> Hi,
>
> can you help me with this? I can't figure it out why it goes "StandAlone".

If you read the log messages, you'll see the reason is quite obvious:
ERROR: write_child: write failure on bcast eth1.: No such device

You expect drbd to communicate over a not (yet) existing device. It can't so 
it outdates itself.

Fix your init scripts!

http://www.drbd.org/users-guide/s-resolve-split-brain.html

> Regards,
> Pedro Sousa
>
> On Thu, May 28, 2009 at 6:49 PM, Pedro Sousa <pgsousa at gmail.com> wrote:
> > Can you check it please?
> >
> > May 27 19:38:35 ha2 heartbeat: [2426]: ERROR: glib: Unable to send bcast
> > [-1] packet(len=217): No such device
> > May 27 19:38:35 ha2 heartbeat: [2426]: ERROR: write_child: write failure
> > on bcast eth1.: No such device
> > May 27 19:38:37 ha2 heartbeat: [2426]: ERROR: glib: Unable to send bcast
> > [-1] packet(len=217): No such device
> > May 27 19:38:37 ha2 heartbeat: [2426]: ERROR: write_child: write failure
> > on bcast eth1.: No such device
> > May 27 19:38:38 ha2 kernel: drbd0: PingAck did not arrive in time.
> > May 27 19:38:38 ha2 kernel: drbd0: peer( Primary -> Unknown ) conn(
> > Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
> > May 27 19:38:38 ha2 kernel: drbd0: asender terminated
> > May 27 19:38:38 ha2 kernel: drbd0: Terminating asender thread
> > May 27 19:38:38 ha2 kernel: drbd0: short read expecting header on sock:
> > r=-512
> > May 27 19:38:38 ha2 kernel: drbd0: Writing meta data super block now.
> > May 27 19:38:38 ha2 kernel: drbd0: tl_clear()
> > May 27 19:38:38 ha2 kernel: drbd0: Connection closed
> > May 27 19:38:38 ha2 kernel: drbd0: conn( NetworkFailure -> Unconnected )
> > May 27 19:38:38 ha2 kernel: drbd0: receiver terminated
> > May 27 19:38:38 ha2 kernel: drbd0: receiver (re)started
> > May 27 19:38:38 ha2 kernel: drbd0: conn( Unconnected -> WFConnection )
> > May 27 19:38:38 ha2 kernel: drbd0: Unable to bind source sock (-99)
> > May 27 19:38:38 ha2 last message repeated 2 times
> > May 27 19:38:38 ha2 kernel: drbd0: Unable to bind sock2 (-99)
> > May 27 19:38:38 ha2 kernel: drbd0: conn( WFConnection -> Disconnecting )
> > May 27 19:38:38 ha2 kernel: drbd0: Discarding network configuration.
> > May 27 19:38:38 ha2 kernel: drbd0: tl_clear()
> > May 27 19:38:38 ha2 kernel: drbd0: Connection closed
> > May 27 19:38:38 ha2 kernel: drbd0: conn( Disconnecting -> StandAlone )
> > May 27 19:38:38 ha2 kernel: drbd0: receiver terminated
> > May 27 19:38:38 ha2 kernel: drbd0: Terminating receiver thread
> > May 27 19:38:39 ha2 heartbeat: [2426]: ERROR: glib: Unable to send bcast
> > [-1] packet(len=217): No such device
> > May 27 19:38:39 ha2 heartbeat: [2426]: ERROR: write_child: write failure
> > on bcast eth1.: No such device
> > May 27 19:38:40 ha2 kernel: drbd0: disk( UpToDate -> Outdated )
> > May 27 19:38:40 ha2 kernel: drbd0: Writing meta data super block now.
> > May 27 19:38:40 ha2 /usr/lib/heartbeat/dopd: [2513]: info: sending return
> > code: 4, ha2.teste.local -> ha1.teste.local
> > May 27 19:38:41 ha2 heartbeat: [2426]: ERROR: glib: Unable to send bcast
> > [-1] packet(len=310): No such device
> > May 27 19:38:41 ha2 heartbeat: [2426]: ERROR: write_child: write failure
> > on bcast eth1.: No such device
> > May 27 19:38:41 ha2 heartbeat: [2426]: ERROR: glib: Unable to send bcast
> > [-1] packet(len=217): No such device
> > May 27 19:38:41 ha2 heartbeat: [2426]: ERROR: write_child: write failure
> > on bcast eth1.: No such device
> > May 27 19:38:43 ha2 heartbeat: [2426]: ERROR: glib: Unable to send bcast
> > [-1] packet(len=217): No such device
> > May 27 19:38:43 ha2 heartbeat: [2426]: ERROR: write_child: write failure
> > on bcast eth1.: No such device
> > May 27 19:38:45 ha2 heartbeat: [2408]: info: Link ha1.teste.local:eth1
> > dead.
> > May 27 19:38:45 ha2 ipfail: [2514]: info: Link Status update: Link
> > ha1.teste.local/eth1 now has status dead
> > May 27 19:38:45 ha2 heartbeat: [2426]: ERROR: glib: Unable to send bcast
> > [-1] packet(len=217): No such device
> > May 27 19:38:45 ha2 heartbeat: [2426]: ERROR: write_child: write failure
> > on bcast eth1.: No such device
> > May 27 19:38:46 ha2 ipfail: [2514]: info: Asking other side for ping node
> > count.
> > May 27 19:38:46 ha2 ipfail: [2514]: info: Checking remote count of ping
> > nodes.
> > May 27 19:38:46 ha2 heartbeat: [2426]: ERROR: glib: Unable to send bcast
> > [-1] packet(len=223): No such device
> > May 27 19:38:46 ha2 heartbeat: [2426]: ERROR: write_child: write failure
> > on bcast eth1.: No such device
> > May 27 19:38:46 ha2 heartbeat: [2426]: WARN: Temporarily Suppressing
> > write error messages
> > May 27 19:38:46 ha2 heartbeat: [2426]: WARN: Is a cable unplugged on
> > bcast eth1?
> > May 27 19:38:47 ha2 ipfail: [2514]: info: Ping node count is balanced.
> > May 27 19:38:48 ha2 ipfail: [2514]: info: No giveup timer to abort.
> > May 27 19:39:06 ha2 kernel: eth1: link up
> >
> > Regards,
> > Pedro Sousa
> >
> >
> >
> >
> > On Thu, May 28, 2009 at 4:51 PM, Lars Ellenberg
> > <lars.ellenberg at linbit.com
> >
> > > wrote:
> >>
> >> On Thu, May 28, 2009 at 01:46:43PM +0100, Pedro Sousa wrote:
> >> > Hi,
> >> >
> >> > I'm testing split-brain in a master/slave scenario with dopd and have
> >>
> >> some
> >>
> >> > doubts about the automatic recovery procedure. The steps I took were:
> >> >
> >> > 1º Unplug the crossover cable
> >> >
> >> > Master:
> >> >
> >> > Primary/Unknown ds:UpToDate/Outdated
> >> >
> >> > Slave:
> >> >
> >> > StandAlone ro:Secondary/Unknown ds:Consistent/DUnknown
> >> >
> >> > 2º Plug the cable back on:
> >> >
> >> > Both nodes remain with the same state: Update/Outdated and
> >> > Consistent/Unknown
> >> >
> >> > My question is: shouldn't the slave rejoin/resync to the master
> >> > automatically after I plug the cable?
> >> >
> >> > I have to manually  run: "drbdadm adjust all" to recover it.
> >>
> >> once a node reaches "StandAlone",
> >> you have to tell it to try and reconnect, yes.
> >>
> >> so this is how it is supposed to be.
> >>
> >> why it goes to "StandAlone" should be in the logs.
> >>
> >> > My conf (centos 5.3; drbd 8.3.1; heartbeat 2.99)
> >> >
> >> > /etc/drbd.conf
> >>
> >> </snip>
> >>
> >>
> >> --
> >>
> >> : Lars Ellenberg
> >> : LINBIT | Your Way to High Availability
> >> : DRBD/HA support and consulting http://www.linbit.com
> >>
> >> DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
> >> __
> >> please don't Cc me, but send to list   --   I'm subscribed
> >> _______________________________________________
> >> drbd-user mailing list
> >> drbd-user at lists.linbit.com
> >> http://lists.linbit.com/mailman/listinfo/drbd-user


-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20090603/66496266/attachment.pgp>


More information about the drbd-user mailing list