[DRBD-user] Problems with applications on top of DRBD locking up, becoming defunct

Lars Ellenberg lars.ellenberg at linbit.com
Tue Sep 4 17:04:38 CEST 2007

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


On Tue, Sep 04, 2007 at 01:21:56PM +0200, Lavender, Ben wrote:
> This seems to be solved,

I'm happy to hear that.

> so just to put this on the mailing list, since
> I'm sure other people have encountered these issues:
> 
> A lot of new Dell PowerEdge servers come with a TCP Offload Engine
> associated with the onboard Broadcom Gigabit NICs.  Though it would seem
> that simply not installing a driver for this would mean it does nothing
> and there would be no problems, this is not at all the case.
> 
> Turning off the TOE requires one to open the case and remove a
> dongle/widget/plugged-in thing with an RJ-11 style connector.  There's
> no software switch on the boxes I have.
> 
> The TOE problems manifested themselves in other applications, such as
> tcpdump, but nowhere were they as clear as DRBD.  Removing the physical
> TOE bit seems to have ended my DRBD issues.  I was locking up every few
> hours, and have been running now for a week without issues.  I even
> recovered nicely from a split-brain after the switch went down during
> some power testing, with no rebooting or module issues required.
> 
> (PS--allowing DRBD to send messages over a null modem cable along with
> heartbeat to avoid a split brain on network device failure would be
> swell.  Dedicated Gigabit NICs are expensive and I don't have ports to
> spare on all of my machines).

it does not need to be a dedicated NIC. it is just better for
performance to avoid application data and replication data on the same
link.

also there is the "outdate-peer" mechanism, which maybe should be
generalized into something of a different name.  basically it is a
userland helper hook, which can be configured to be called when we want
to stay primary after connection loss, or when we want to become Primary
but cannot talk to our peer.  one implementation involves the "dopd"
plugin to heartbeat, enabling it to use the heartbeat communication
framework for some emergency communication.

what you definetely do not want is "drbd over heartbeat comm".

and, btw, something aparently switched one server to Primary while the
other still has been primary, this being a wrong decision.
and if that has been heartbeat, you either misconfigured it,
or it could not communicate to its peer either.
if the latter, out-of-band comm via null modem aparently did not help.

-- 
: Lars Ellenberg                            Tel +43-1-8178292-0  :
: LINBIT Information Technologies GmbH      Fax +43-1-8178292-82 :
: Vivenotgasse 48, A-1120 Vienna/Europe    http://www.linbit.com :
__
please use the "List-Reply" function of your email client.



More information about the drbd-user mailing list