[DRBD-user] DRBD initial sync crashes primary

Wed Mar 24 13:33:45 CET 2004

/ 2004-03-23 12:19:55 -0800
\ Doug Utzig:
> 
> I'm evaluating DRBD and having issues with the initial sync.
> Below is the drbd.conf file and the info reported to the console.
> 
> Thank you in advance.
> 
> I start up drbd manually using "/etc/init.d/drbd start" on both
> systems.  I make the first system primary.
> 
> - Before each attempt, I remove /var/lib/drbd/drbd0.
> - During the initial resync, the underlying disks are not being
>   used for any other purpose (i.e. the primary is otherwise idle).
> - Primary crashes regardless of which system is primary.
> - I've tried the config using both the 100Mbit and Gbit cards, but
>   resync using either fails.  The primary crashes more quickly
>   when using Gbit.
> - I did have resync succeed once, but I'm not sure the steps that
>   led up to it.  Once the resync was complete, both sides were
>   Secondary (per /proc/drbd).  In order to use it, one side had to
>   be promoted to Primary.
> 
> Linux: RedHat AS2.1 2.4.9-e.25enterprise
> CPU:   4x2.80GHz Xeon
> RAM:   6GB
> NIC:   10/100 Intel PRO/1000 using e1000 driver
>        1Gbit Broadcom BCM5701 using bcm5700 driver
> 
> / 2004-03-23 15:18:22 -0800
> \ Doug Utzig:
> > I had the problem with 0.6.10 originally.
> > I upgraded to 0.6.12 and still the same problem.

> # cat /etc/drbd.conf
not exactly related. but some comments anyways:
> resource drbd0 {
> 
>   protocol = C
>   fsckcmd  = /bin/true
> 
>   disk {
> #    disk-size = 139668736k
>   }
> 
>   net {
>     sndbuf-size = 8M 

This is MUCH.  Reduce it, or explain.
This does not make sense (to me)

rest seems ok.

> CPU:    2
> EIP:    0010:[<c01249f1>]   Not tainted
> EFLAGS: 00010086
> EIP is at mod_timer [kernel] 0x131

Note that on a SMP box, the first "listed" oops does not need to
be the oops that *occured* first.
You happen to see some other oopses/panics
at (almost) the same time?
One of them happen to be a
	"NMI Watchdog detected lockup on CPU #" ?

The listed bug is nonsense, but it reminds me on some very strange
case we had with a support customer last ... was it October?

It was RedHat AS2.1 2.4.9-e.25enterprise and a Dual Xeon iirc,
using a qla storage device. Not DRBDs fault, it only was the
trigger.

2.4.9-e.25 is not exactly the most recent kernel. 
There are more recent ones from the RH 2.4.9 series,
and there are the RH{E,A?}S 3 kernel series, which are based
on 2.4.18 (?) or newer.

We never could *prove* what the problem was exactly, but the
pointers all where that 2.4.9-whatever did either not properly
support SMP on Xeon (regarding memory ordering...[1]), and/or some
of the drivers of the storage device or NIC had SMP problems
(maybe due to [1]), and caused CPU lockup.
Which then triggers the NMI Watchdog panic...

Upgrading the kernel helped back then.

	Lars Ellenberg