Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
/ 2004-03-23 12:19:55 -0800
\ Doug Utzig:
>
> I'm evaluating DRBD and having issues with the initial sync.
> Below is the drbd.conf file and the info reported to the console.
>
> Thank you in advance.
>
> I start up drbd manually using "/etc/init.d/drbd start" on both
> systems. I make the first system primary.
>
> - Before each attempt, I remove /var/lib/drbd/drbd0.
> - During the initial resync, the underlying disks are not being
> used for any other purpose (i.e. the primary is otherwise idle).
> - Primary crashes regardless of which system is primary.
> - I've tried the config using both the 100Mbit and Gbit cards, but
> resync using either fails. The primary crashes more quickly
> when using Gbit.
> - I did have resync succeed once, but I'm not sure the steps that
> led up to it. Once the resync was complete, both sides were
> Secondary (per /proc/drbd). In order to use it, one side had to
> be promoted to Primary.
>
> Linux: RedHat AS2.1 2.4.9-e.25enterprise
> CPU: 4x2.80GHz Xeon
> RAM: 6GB
> NIC: 10/100 Intel PRO/1000 using e1000 driver
> 1Gbit Broadcom BCM5701 using bcm5700 driver
>
> / 2004-03-23 15:18:22 -0800
> \ Doug Utzig:
> > I had the problem with 0.6.10 originally.
> > I upgraded to 0.6.12 and still the same problem.
> # cat /etc/drbd.conf
not exactly related. but some comments anyways:
> resource drbd0 {
>
> protocol = C
> fsckcmd = /bin/true
>
> disk {
> # disk-size = 139668736k
> }
>
> net {
> sndbuf-size = 8M
This is MUCH. Reduce it, or explain.
This does not make sense (to me)
rest seems ok.
> CPU: 2
> EIP: 0010:[<c01249f1>] Not tainted
> EFLAGS: 00010086
> EIP is at mod_timer [kernel] 0x131
Note that on a SMP box, the first "listed" oops does not need to
be the oops that *occured* first.
You happen to see some other oopses/panics
at (almost) the same time?
One of them happen to be a
"NMI Watchdog detected lockup on CPU #" ?
The listed bug is nonsense, but it reminds me on some very strange
case we had with a support customer last ... was it October?
It was RedHat AS2.1 2.4.9-e.25enterprise and a Dual Xeon iirc,
using a qla storage device. Not DRBDs fault, it only was the
trigger.
2.4.9-e.25 is not exactly the most recent kernel.
There are more recent ones from the RH 2.4.9 series,
and there are the RH{E,A?}S 3 kernel series, which are based
on 2.4.18 (?) or newer.
We never could *prove* what the problem was exactly, but the
pointers all where that 2.4.9-whatever did either not properly
support SMP on Xeon (regarding memory ordering...[1]), and/or some
of the drivers of the storage device or NIC had SMP problems
(maybe due to [1]), and caused CPU lockup.
Which then triggers the NMI Watchdog panic...
Upgrading the kernel helped back then.
Lars Ellenberg