[DRBD-user] Automatic recover dopd after split brain recover

Maros Timko timkom at gmail.com
Wed Jun 3 22:26:29 CEST 2009

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hi Pedro,

sorry, I am not following the thread from the beginning. But we can see from
the log excerpt that your eth1 NIC has disapeared for half a minute. You
should check the whole log file for what was happening at that time. Could
it be a network service restart?
As Lars already answered you, once DRBD resource went StandAlone, you have
to reconnect it manually. There are some properties available in drbd.conf
that defines the time period or number of attempts after which it gaves up
trying to reconnect.
Tino
2009/6/3 Pedro Sousa <pgsousa at gmail.com>

> Hi,
>
> can you help me with this? I can't figure it out why it goes "StandAlone".
>
> Regards,
> Pedro Sousa
>
> On Thu, May 28, 2009 at 6:49 PM, Pedro Sousa <pgsousa at gmail.com> wrote:
>
>> Can you check it please?
>>
>> May 27 19:38:35 ha2 heartbeat: [2426]: ERROR: glib: Unable to send bcast
>> [-1] packet(len=217): No such device
>> May 27 19:38:35 ha2 heartbeat: [2426]: ERROR: write_child: write failure
>> on bcast eth1.: No such device
>> May 27 19:38:37 ha2 heartbeat: [2426]: ERROR: glib: Unable to send bcast
>> [-1] packet(len=217): No such device
>> May 27 19:38:37 ha2 heartbeat: [2426]: ERROR: write_child: write failure
>> on bcast eth1.: No such device
>> May 27 19:38:38 ha2 kernel: drbd0: PingAck did not arrive in time.
>> May 27 19:38:38 ha2 kernel: drbd0: peer( Primary -> Unknown ) conn(
>> Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
>> May 27 19:38:38 ha2 kernel: drbd0: asender terminated
>> May 27 19:38:38 ha2 kernel: drbd0: Terminating asender thread
>> May 27 19:38:38 ha2 kernel: drbd0: short read expecting header on sock:
>> r=-512
>> May 27 19:38:38 ha2 kernel: drbd0: Writing meta data super block now.
>> May 27 19:38:38 ha2 kernel: drbd0: tl_clear()
>> May 27 19:38:38 ha2 kernel: drbd0: Connection closed
>> May 27 19:38:38 ha2 kernel: drbd0: conn( NetworkFailure -> Unconnected )
>> May 27 19:38:38 ha2 kernel: drbd0: receiver terminated
>> May 27 19:38:38 ha2 kernel: drbd0: receiver (re)started
>> May 27 19:38:38 ha2 kernel: drbd0: conn( Unconnected -> WFConnection )
>> May 27 19:38:38 ha2 kernel: drbd0: Unable to bind source sock (-99)
>> May 27 19:38:38 ha2 last message repeated 2 times
>> May 27 19:38:38 ha2 kernel: drbd0: Unable to bind sock2 (-99)
>> May 27 19:38:38 ha2 kernel: drbd0: conn( WFConnection -> Disconnecting )
>> May 27 19:38:38 ha2 kernel: drbd0: Discarding network configuration.
>> May 27 19:38:38 ha2 kernel: drbd0: tl_clear()
>> May 27 19:38:38 ha2 kernel: drbd0: Connection closed
>> May 27 19:38:38 ha2 kernel: drbd0: conn( Disconnecting -> StandAlone )
>> May 27 19:38:38 ha2 kernel: drbd0: receiver terminated
>> May 27 19:38:38 ha2 kernel: drbd0: Terminating receiver thread
>> May 27 19:38:39 ha2 heartbeat: [2426]: ERROR: glib: Unable to send bcast
>> [-1] packet(len=217): No such device
>> May 27 19:38:39 ha2 heartbeat: [2426]: ERROR: write_child: write failure
>> on bcast eth1.: No such device
>> May 27 19:38:40 ha2 kernel: drbd0: disk( UpToDate -> Outdated )
>> May 27 19:38:40 ha2 kernel: drbd0: Writing meta data super block now.
>> May 27 19:38:40 ha2 /usr/lib/heartbeat/dopd: [2513]: info: sending return
>> code: 4, ha2.teste.local -> ha1.teste.local
>> May 27 19:38:41 ha2 heartbeat: [2426]: ERROR: glib: Unable to send bcast
>> [-1] packet(len=310): No such device
>> May 27 19:38:41 ha2 heartbeat: [2426]: ERROR: write_child: write failure
>> on bcast eth1.: No such device
>> May 27 19:38:41 ha2 heartbeat: [2426]: ERROR: glib: Unable to send bcast
>> [-1] packet(len=217): No such device
>> May 27 19:38:41 ha2 heartbeat: [2426]: ERROR: write_child: write failure
>> on bcast eth1.: No such device
>> May 27 19:38:43 ha2 heartbeat: [2426]: ERROR: glib: Unable to send bcast
>> [-1] packet(len=217): No such device
>> May 27 19:38:43 ha2 heartbeat: [2426]: ERROR: write_child: write failure
>> on bcast eth1.: No such device
>> May 27 19:38:45 ha2 heartbeat: [2408]: info: Link ha1.teste.local:eth1
>> dead.
>> May 27 19:38:45 ha2 ipfail: [2514]: info: Link Status update: Link
>> ha1.teste.local/eth1 now has status dead
>> May 27 19:38:45 ha2 heartbeat: [2426]: ERROR: glib: Unable to send bcast
>> [-1] packet(len=217): No such device
>> May 27 19:38:45 ha2 heartbeat: [2426]: ERROR: write_child: write failure
>> on bcast eth1.: No such device
>> May 27 19:38:46 ha2 ipfail: [2514]: info: Asking other side for ping node
>> count.
>> May 27 19:38:46 ha2 ipfail: [2514]: info: Checking remote count of ping
>> nodes.
>> May 27 19:38:46 ha2 heartbeat: [2426]: ERROR: glib: Unable to send bcast
>> [-1] packet(len=223): No such device
>> May 27 19:38:46 ha2 heartbeat: [2426]: ERROR: write_child: write failure
>> on bcast eth1.: No such device
>> May 27 19:38:46 ha2 heartbeat: [2426]: WARN: Temporarily Suppressing write
>> error messages
>> May 27 19:38:46 ha2 heartbeat: [2426]: WARN: Is a cable unplugged on bcast
>> eth1?
>> May 27 19:38:47 ha2 ipfail: [2514]: info: Ping node count is balanced.
>> May 27 19:38:48 ha2 ipfail: [2514]: info: No giveup timer to abort.
>> May 27 19:39:06 ha2 kernel: eth1: link up
>>
>> Regards,
>> Pedro Sousa
>>
>>
>>
>>
>> On Thu, May 28, 2009 at 4:51 PM, Lars Ellenberg <
>> lars.ellenberg at linbit.com> wrote:
>>
>>> On Thu, May 28, 2009 at 01:46:43PM +0100, Pedro Sousa wrote:
>>> > Hi,
>>> >
>>> > I'm testing split-brain in a master/slave scenario with dopd and have
>>> some
>>> > doubts about the automatic recovery procedure. The steps I took were:
>>> >
>>> > 1º Unplug the crossover cable
>>> >
>>> > Master:
>>> >
>>> > Primary/Unknown ds:UpToDate/Outdated
>>> >
>>> > Slave:
>>> >
>>> > StandAlone ro:Secondary/Unknown ds:Consistent/DUnknown
>>> >
>>> > 2º Plug the cable back on:
>>> >
>>> > Both nodes remain with the same state: Update/Outdated and
>>> > Consistent/Unknown
>>> >
>>> > My question is: shouldn't the slave rejoin/resync to the master
>>> > automatically after I plug the cable?
>>> >
>>> > I have to manually  run: "drbdadm adjust all" to recover it.
>>>
>>> once a node reaches "StandAlone",
>>> you have to tell it to try and reconnect, yes.
>>>
>>> so this is how it is supposed to be.
>>>
>>> why it goes to "StandAlone" should be in the logs.
>>>
>>> > My conf (centos 5.3; drbd 8.3.1; heartbeat 2.99)
>>> >
>>> > /etc/drbd.conf
>>>
>>> </snip>
>>>
>>>
>>> --
>>> : Lars Ellenberg
>>> : LINBIT | Your Way to High Availability
>>> : DRBD/HA support and consulting http://www.linbit.com
>>>
>>> DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
>>> __
>>> please don't Cc me, but send to list   --   I'm subscribed
>>> _______________________________________________
>>> drbd-user mailing list
>>> drbd-user at lists.linbit.com
>>> http://lists.linbit.com/mailman/listinfo/drbd-user
>>>
>>
>>
>
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20090603/419a91ea/attachment.htm>


More information about the drbd-user mailing list