[DRBD-user] restart of both servers after network failure ??? (large)

Sat May 9 21:54:41 CEST 2009

yes, this sounds like the cascading failure I had. (I reconfigured
the network card on a secondary and the primary crashed, that box
(crashed primary) had been secondarying for another machine, and 
it promptly crashed too!)

[my cisco switch has a habit of "isolating" interfaces when they change 
configs, and may have been a contributing factor, making a 1/4 second 
reconfig into a 30 second outage.]

AFAIK, it is only a 
problem when you export /dev/drbd* files to xen guests. The simplest fix 
was to switch to protocol A. The technical details were a bit complicated 
and I've forgotten some of it, but basically, the data being sent was 
being free'd by something related to the xen blkback driver (?). Changing 
to protocol A means that drbd forgets about it sooner and doesn't try to 
access it after the free(). Lars was suggesting/expect that the
8.3.2 release will have "nosendpage" support, which is effectively 
what is needed to avoid the problem.

Since finding the problem, I've just avoided it by not exporting
drbd devices directly to xen.

If you really want protocol C, the code change is pretty simple, as there 
already is a "_drbd_no_send_page()" function, you just want to change the 
logic so it always goes down that path (currently it's a special
case exception).

That said, AFAIK, 0.7 should be fine. I have only encountered the
issue with the 8.x versions...

-Tom

On Sat, 9 May 2009, Lars Ellenberg wrote:

> On Fri, May 08, 2009 at 05:38:06PM -0400, Victor Hugo dos Santos wrote:
>> On Fri, Feb 20, 2009 at 2:37 PM, Victor Hugo dos Santos
>> <listas.vhs at gmail.com> wrote:
>>> Hello,
>>>
>>> I have a problem with drbd-0.7.25 and drbd-0.8.2.6... my situation is:
>>>
>>> two servers Supermicro in company A connected with crossover cable and
>>> CentOS 5.2 (all updates installed)
>>> two servers Poweredge in company B connected with network fiber in
>>> separate sites and Citrix XenServer 4 installed.
>>>
>>> the problem is that time in time, both servers restart without
>>> apparent reason.. in logs, only show messages about network failure
>>> and after this, server restart.
>>> in company A... this problem occurred 2 o 3 times and the last
>>> incident is on 4 months ago..
>>> and I had forget this problem.. because, I think that could be for
>>> electrical energy line in this company.
>>> but now, in company B.. I have the same problem for first time (after
>>> various months work fine) and this servers is connected in UPS line.
>>>
>>> two servers groups are running a Virtualization Server.. but from
>>> different vendors and configurations..
>>> Memory, disks and network work fine in four servers and, DRBD resource
>>> contain only data from VMs, none files/data from owner server.
>>>
>>> and I don't understand why servers restart when recive a error from
>>> network !!???
>>> and in case of problem..I think that restart of VMs is probably but
>>> not of complete Server.
>>>
>>> Above, logs and config file of two servers in company B...