[DRBD-user] Loss of Connection

George H george.dma at gmail.com
Tue Dec 5 07:28:23 CET 2006

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


I think I experienced the one problem you mentioned about the DRBD
destroying the more recent side. It was really anoying me. I too had
drbd set to go stand alone so I could resync it manually. I couldn't
get a good setup because it would split-brain my nodes. By this I mean
I got a cross over cable for DRBD and heartbeat on the LAN (No
heartbeat via the cross over). So while the lan died, the drbd cross
over was still active and thus caused my nodes to split brain.

The solution that I found (which may or may not work for you) was to
STONITH the node that has any level of failure. This works by settings
DRBD to reconnect on disconnect, and to set auto-failback to false.

So basically if anything goes wrong, the slave node will kill the
master node, thus the DRBD resource will become available, heart beat
will properly failover to the slave and become the new master. The
killed node will stay turned off until an admin gets there. Once the
killed node gets turned back on drbd will connect and just do a
resync. During my tests there was no data loss doing it this way. Once
the killed node has fully synced, it stays a slave.

BTW, if only 1 out of the 8 losses connection spariodically, and it's
always the same one, you could be looking at a hardware problem. You
may want to look into that, perhaps replace the disk and see if it
still happens. If not then you can certainly rule out hardware
failure.

On 12/5/06, Holgilein <lists at loomsday.co.nz> wrote:
> > Just wondering, is drbd set up to reconnect upon a connection failure
> > or to go stand alone?
>
> The devices are configured to go standalone in case of a disconnect.
>
> When I initially set up DRBD on those machines, I wanted to avoid
> resyncing with swapped Primary <=> Secondary roles, i.e. destroying
> the most recent side of the DRBD device pair on such a connection loss/
> resync scenario. So I decided to let the device go stand alone in case
> of a disconnect, and resync the devices manually.
>
> In hindsight, I am not too sure any more if that was the best decision -
> if DRBD made a such mistake while being on "resync", it would most
> certainly do exactly the same thing when I reconnect the devices by
> hand.
>
> > Perhaps a network failure is occuring and your
> > nodes are getting split-brain, thus no failover is being done (just a
> > though here)
>
> Currently, I have 8 DRBD device pairs on this server. Only one loses
> its connection sporadically, and Heartbeat does not kick in (which
> is GOOD, to say the least).
>
> Nevertheless, I had a combination of errors once:
>
> 1) Loss of connection, leading to "Standalone" vs "Unknown" state
> 2) A server problem that caused a reboot
> 3) After which the failover node came up on the "Unknown" (aka old)
>     side of the device pair.
> 4) After manually resyncing, I finally had finished the good side,
>     as now the bad side had been considered to be the most recent one
>     and became SyncSource.
> 5) Having daily disk backups based on rsync is wise.
>
> I was lucky enough to detect this right after reboot, otherwise
> I would have been in real trouble, due to the data loss since the
> devices had lost connection.
>
> Anyway, nowadays I am using SEC (http://simple-evcorr.sourceforge.net)
> to monitor the syslog and send me an email if there are any DRBD
> related errors popping up. It's been a life-saver...
>
> To recap, the question remains: Shall I simply use the reconnect
> option for these devices and relax, or shall I be concerned about
> the periodic loss of connection?
>
> > On 12/4/06, Holgilein <lists at loomsday.co.nz> wrote:
> >> Hi there,
> >>
> >> I am running drbd-0.7.22 on kernel 2.6.17.11 (including Vserver patch
> >> set v2.0.2-rc31), and I am using eth1 (tg3 driver on Broadcom BCM5780
> >> Gigabit adapter) on a 2-node-cluster to synchronise the DRBD devices.
> >>
> >> This network interface is also used to pull backups across the
> >> two nodes, and when the link is under heavy load I observed that
> >> sometimes (every 5-6 days) DRBD loses its inter-node connection:
> >>
> >> Dec  5 05:40:45 wgr-host1 kernel: drbd0: [reiserfs/3/1830] sock_sendmsg
> >> time expired, ko = 3
> >> Dec  5 05:40:48 wgr-host1 kernel: drbd0: [reiserfs/3/1830] sock_sendmsg
> >> time expired, ko = 2
> >> Dec  5 05:40:51 wgr-host1 kernel: drbd0: [reiserfs/3/1830] sock_sendmsg
> >> time expired, ko = 1
> >> Dec  5 05:40:54 wgr-host1 kernel: drbd0: reiserfs/3 [1830]: cstate
> >> Connected --> NetworkFailure
> >> Dec  5 05:40:54 wgr-host1 kernel: drbd0: drbd0_receiver [10068]: cstate
> >> NetworkFailure --> BrokenPipe
> >> Dec  5 05:40:54 wgr-host1 kernel: drbd0: short read expecting header on
> >> sock: r=-512
> >> Dec  5 05:40:54 wgr-host1 kernel: drbd0: asender terminated
> >> Dec  5 05:40:54 wgr-host1 kernel: drbd0: worker terminated
> >> Dec  5 05:40:54 wgr-host1 kernel: drbd0: drbd0_receiver [10068]: cstate
> >> BrokenPipe --> Unconnected
> >> Dec  5 05:40:54 wgr-host1 kernel: drbd0: Connection lost.
> >> Dec  5 05:40:54 wgr-host1 kernel: drbd0: drbd0_receiver [10068]: cstate
> >> Unconnected --> StandAlone
> >> Dec  5 05:40:54 wgr-host1 kernel: drbd0: receiver terminated
> >>
> >> Is this a known behaviour, and is there anything I can do to remedy?
> >>
> >> Many thanks,
> >>
> >> Holger
> >> _______________________________________________
> >> drbd-user mailing list
> >> drbd-user at lists.linbit.com
> >> http://lists.linbit.com/mailman/listinfo/drbd-user
> >>
> >
> >
>
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user
>


-- 
"The probability of anything happening is in inverse ratio to its desirability"
"If I were a roman statue, I'd be made alabastard"
--
George H
george.dma at gmail.com



More information about the drbd-user mailing list