Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
/ 2004-03-23 12:19:55 -0800 \ Doug Utzig: > > I'm evaluating DRBD and having issues with the initial sync. > Below is the drbd.conf file and the info reported to the console. > > Thank you in advance. > > I start up drbd manually using "/etc/init.d/drbd start" on both > systems. I make the first system primary. > > - Before each attempt, I remove /var/lib/drbd/drbd0. > - During the initial resync, the underlying disks are not being > used for any other purpose (i.e. the primary is otherwise idle). > - Primary crashes regardless of which system is primary. > - I've tried the config using both the 100Mbit and Gbit cards, but > resync using either fails. The primary crashes more quickly > when using Gbit. > - I did have resync succeed once, but I'm not sure the steps that > led up to it. Once the resync was complete, both sides were > Secondary (per /proc/drbd). In order to use it, one side had to > be promoted to Primary. > > Linux: RedHat AS2.1 2.4.9-e.25enterprise > CPU: 4x2.80GHz Xeon > RAM: 6GB > NIC: 10/100 Intel PRO/1000 using e1000 driver > 1Gbit Broadcom BCM5701 using bcm5700 driver > > / 2004-03-23 15:18:22 -0800 > \ Doug Utzig: > > I had the problem with 0.6.10 originally. > > I upgraded to 0.6.12 and still the same problem. > # cat /etc/drbd.conf not exactly related. but some comments anyways: > resource drbd0 { > > protocol = C > fsckcmd = /bin/true > > disk { > # disk-size = 139668736k > } > > net { > sndbuf-size = 8M This is MUCH. Reduce it, or explain. This does not make sense (to me) rest seems ok. > CPU: 2 > EIP: 0010:[<c01249f1>] Not tainted > EFLAGS: 00010086 > EIP is at mod_timer [kernel] 0x131 Note that on a SMP box, the first "listed" oops does not need to be the oops that *occured* first. You happen to see some other oopses/panics at (almost) the same time? One of them happen to be a "NMI Watchdog detected lockup on CPU #" ? The listed bug is nonsense, but it reminds me on some very strange case we had with a support customer last ... was it October? It was RedHat AS2.1 2.4.9-e.25enterprise and a Dual Xeon iirc, using a qla storage device. Not DRBDs fault, it only was the trigger. 2.4.9-e.25 is not exactly the most recent kernel. There are more recent ones from the RH 2.4.9 series, and there are the RH{E,A?}S 3 kernel series, which are based on 2.4.18 (?) or newer. We never could *prove* what the problem was exactly, but the pointers all where that 2.4.9-whatever did either not properly support SMP on Xeon (regarding memory ordering...[1]), and/or some of the drivers of the storage device or NIC had SMP problems (maybe due to [1]), and caused CPU lockup. Which then triggers the NMI Watchdog panic... Upgrading the kernel helped back then. Lars Ellenberg