Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
I'm not sure if it is related, but when I was running version 3.16 of the kernel on our Gentoo servers here I had nothing but trouble with DRBD+OCFS2. I was using whatever the kernel was providing for DRBD (which I believe was 8.4.3), but it went away when I moved back down to a lower version. It should be noted that 3.16 was only supported for like 3 months before they moved on to 3.17 and above: http://en.wikipedia.org/wiki/Linux_kernel#3.x.y_releases I haven't tried the current stable branch of 3.18, but it might be worth a try if you can. On Thu, Apr 9, 2015 at 1:07 PM, Alan Evetts <alan at wrfinance.com> wrote: > Hi there, > > I am reaching out because we have been trying to find stability in our > move to DRBD as it is amazing in concept, but have struggled for 6 months > of time. I am going to just lay out everything we are doing, as the > problem starts and stops when we introduce/remove DRBD from the picture. > Obviously, these setups get complicated so hopefully this isn’t too much > information here. > > What we are trying to do is have a pair of Dell R610 machines, each > running DRBD and xen with about 8 DRBD partitions, each master running half > of the Xen virtual machines. > > Seems, between 1 and 20 days we always receive a kernel panic on 1 > machine, which will often drag down the second machine. Details of the > most recent panic are below. > > In order to rule out problems we have: > - Replace both Dell R610 (have 4 now total, all the same problem) > - Upgraded to Debian Jessie from Debian Wheezy > - Running xen-hypervisor-4.4-amd64, drbd debian version > 8.9.2~rc1-2, kernel 3.16.0-4 > - Switched from the on-board broadcom NICs to Intel E1G44HTBLK 4 > port PCI-e NIC > - Upgraded to igb kernel module 5.2.17 and rebuilt it into the > initrd as well > > > The 2 servers both have lots of resources (64 gigs of ram, quad xeon 2.4, > 6 * 1 TB drives in a raid 10). There is a cross over cable on ETH3 for > DRBD, each drbd instance runs on its own port on ETH3. The Xen config runs > on a bridge. > > The problem has more or less been the same as we’ve moved through all of > the hardware and software versions over the past 6 months. It rotates > between the servers. > > I am hoping someone can spot a problem in our config, or guide us on what > to try from here. All 4 dell machines have been patched and had the > diagnostics ran on them without issue. > > The problem. One of the machines will have a transit queue time-out on an > interface (oddly, not necessarily the drbd interface - but usually). From > there, a panic, and the NIC will start going up and down. This then starts > to drive the load up, the machines soon become unresponsive over shell. > Connected over the dRAC remote access port, sooner or later we see errors > about the drives not responding, I think this is from the load but I do not > know for sure. From this point the machine will sometimes drag down its > paired DRBD machine, and sometimes not. The one with the crash needs a > hard reboot at this point. > > We love DRBD, its simplicity and functionality but it introduces these > often crashes which are not worth it. Hoping someone can spot an error we > are doing here, or have ideas on what to try. > > Thanks in advance for any help.. and FYI this crashed used to happen in > the broadcom queue, now its the intel queue, and only when we have drbd > enabled. > > > > Apr 9 03:39:17 v2 kernel: [141714.850432] ------------[ cut here > ]------------ > Apr 9 03:39:17 v2 kernel: [141714.850521] WARNING: CPU: 0 PID: 0 at > /build/linux-y7bjb0/linux-3.16.7-ckt4/net/sched/sch_generic.c:264 > dev_watchdog+0x236/0x240() > Apr 9 03:39:17 v2 kernel: [141714.850527] NETDEV WATCHDOG: eth1 (igb): > transmit queue 0 timed out > Apr 9 03:39:17 v2 kernel: [141714.850531] Modules linked in: xt_tcpudp > xt_physdev iptable_filter ip_tables x_tables xen_netback xen_blkback > nfnetlink_queue nfnetlink_log nfnetlink bluetooth 6lowpan_iphc rfkill > xen_gntdev xen_evt > chn xenfs xen_privcmd nfsd auth_rpcgss oid_registry nfs_acl nfs lockd > fscache sunrpc bridge stp llc ttm drm_kms_helper joydev drm i2c_algo_bit > i2c_core pcspkr wmi iTCO_wdt iTCO_vendor_support psmouse dcdbas serio_raw > evdev tpm_ti > s tpm lpc_ich mfd_core acpi_power_meter button coretemp i7core_edac > edac_core shpchp processor thermal_sys loop ipmi_watchdog ipmi_si > ipmi_poweroff ipmi_devintf ipmi_msghandler drbd lru_cache libcrc32c autofs4 > ext4 crc16 mbcache > jbd2 dm_mod sg sd_mod crc_t10dif crct10dif_generic sr_mod cdrom ses > crct10dif_common enclosure ata_generic hid_generic usbhid hid crc32c_intel > ata_piix ehci_pci uhci_hcd libata igb(O) megaraid_sas ehci_hcd scsi_mod > usbcore dca pt > p usb_common pps_core > Apr 9 03:39:17 v2 kernel: [141714.850609] CPU: 0 PID: 0 Comm: swapper/0 > Tainted: G O 3.16.0-4-amd64 #1 Debian 3.16.7-ckt4-3 > Apr 9 03:39:17 v2 kernel: [141714.850613] Hardware name: Dell Inc. > PowerEdge R610/0XDN97, BIOS 6.4.0 07/23/2013 > Apr 9 03:39:17 v2 kernel: [141714.850617] 0000000000000009 > ffffffff815096a7 ffff880079e03e28 ffffffff810676f7 > Apr 9 03:39:17 v2 kernel: [141714.850622] 0000000000000000 > ffff880079e03e78 0000000000000010 0000000000000000 > Apr 9 03:39:17 v2 kernel: [141714.850626] ffff8800445c8000 > ffffffff8106775c ffffffff81777270 ffffffff00000030 > Apr 9 03:39:17 v2 kernel: [141714.850631] Call Trace: > Apr 9 03:39:17 v2 kernel: [141714.850635] <IRQ> [<ffffffff815096a7>] ? > dump_stack+0x41/0x51 > Apr 9 03:39:17 v2 kernel: [141714.850652] [<ffffffff810676f7>] ? > warn_slowpath_common+0x77/0x90 > Apr 9 03:39:17 v2 kernel: [141714.850660] [<ffffffff8106775c>] ? > warn_slowpath_fmt+0x4c/0x50 > Apr 9 03:39:17 v2 kernel: [141714.850669] [<ffffffff81074647>] ? > mod_timer+0x127/0x1e0 > Apr 9 03:39:17 v2 kernel: [141714.850676] [<ffffffff8143ce76>] ? > dev_watchdog+0x236/0x240 > Apr 9 03:39:17 v2 kernel: [141714.850681] [<ffffffff8143cc40>] ? > dev_graft_qdisc+0x70/0x70 > Apr 9 03:39:17 v2 kernel: [141714.850686] [<ffffffff810729b1>] ? > call_timer_fn+0x31/0x100 > Apr 9 03:39:17 v2 kernel: [141714.850691] [<ffffffff8143cc40>] ? > dev_graft_qdisc+0x70/0x70 > Apr 9 03:39:17 v2 kernel: [141714.850698] [<ffffffff81073fe9>] ? > run_timer_softirq+0x209/0x2f0 > Apr 9 03:39:17 v2 kernel: [141714.850704] [<ffffffff8106c591>] ? > __do_softirq+0xf1/0x290 > Apr 9 03:39:17 v2 kernel: [141714.850709] [<ffffffff8106c965>] ? > irq_exit+0x95/0xa0 > Apr 9 03:39:17 v2 kernel: [141714.850718] [<ffffffff813579c5>] ? > xen_evtchn_do_upcall+0x35/0x50 > Apr 9 03:39:17 v2 kernel: [141714.850725] [<ffffffff8151141e>] ? > xen_do_hypervisor_callback+0x1e/0x30 > Apr 9 03:39:17 v2 kernel: [141714.850728] <EOI> [<ffffffff810013aa>] ? > xen_hypercall_sched_op+0xa/0x20 > Apr 9 03:39:17 v2 kernel: [141714.850737] [<ffffffff810013aa>] ? > xen_hypercall_sched_op+0xa/0x20 > Apr 9 03:39:17 v2 kernel: [141714.850746] [<ffffffff81009e0c>] ? > xen_safe_halt+0xc/0x20 > Apr 9 03:39:17 v2 kernel: [141714.850756] [<ffffffff8101c959>] ? > default_idle+0x19/0xb0 > Apr 9 03:39:17 v2 kernel: [141714.850764] [<ffffffff810a7dc0>] ? > cpu_startup_entry+0x340/0x400 > Apr 9 03:39:17 v2 kernel: [141714.850770] [<ffffffff81902071>] ? > start_kernel+0x492/0x49d > Apr 9 03:39:17 v2 kernel: [141714.850775] [<ffffffff81901a04>] ? > set_init_arg+0x4e/0x4e > Apr 9 03:39:17 v2 kernel: [141714.850781] [<ffffffff81903f64>] ? > xen_start_kernel+0x569/0x573 > Apr 9 03:39:17 v2 kernel: [141714.850785] ---[ end trace ee11063cf033829a > ]--- > Apr 9 03:39:17 v2 kernel: [141714.871945] br1: port 1(eth1) entered > disabled state > Apr 9 03:39:20 v2 kernel: [141718.210743] igb 0000:05:00.1 eth1: igb: > eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None > Apr 9 03:39:20 v2 kernel: [141718.210913] br1: port 1(eth1) entered > forwarding state > Apr 9 03:39:20 v2 kernel: [141718.210923] br1: port 1(eth1) entered > forwarding state > Apr 9 03:39:26 v2 kernel: [141723.863194] br1: port 1(eth1) entered > disabled state > Apr 9 03:39:30 v2 kernel: [141727.650897] igb 0000:05:00.1 eth1: igb: > eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None > Apr 9 03:39:30 v2 kernel: [141727.651040] br1: port 1(eth1) entered > forwarding state > Apr 9 03:39:30 v2 kernel: [141727.651053] br1: port 1(eth1) entered > forwarding state > Apr 9 03:39:31 v2 kernel: [141728.890509] ata1: lost interrupt (Status > 0x50) > Apr 9 03:39:31 v2 kernel: [141728.890560] sr 1:0:0:0: CDB: > Apr 9 03:39:31 v2 kernel: [141728.890563] Get event status notification: > 4a 01 00 00 10 00 00 00 08 00 > Apr 9 03:39:31 v2 kernel: [141728.890630] ata1: hard resetting link > Apr 9 03:39:31 v2 kernel: [141729.366592] ata1: SATA link up 1.5 Gbps > (SStatus 113 SControl 300) > Apr 9 03:39:32 v2 kernel: [141729.406749] ata1.00: configured for UDMA/100 > Apr 9 03:39:32 v2 kernel: [141729.408192] ata1: EH complete > Apr 9 03:39:35 v2 kernel: [141732.711653] br1: port 1(eth1) entered > disabled state > Apr 9 03:39:37 v2 kernel: [141734.678485] drbd s3: peer( Primary -> > Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) > Apr 9 03:39:37 v2 kernel: [141734.678846] drbd s3: asender terminated > Apr 9 03:39:37 v2 kernel: [141734.678852] drbd s3: Terminating drbd_a_s3 > Apr 9 03:39:37 v2 kernel: [141734.678956] drbd s3: Connection closed > Apr 9 03:39:37 v2 kernel: [141734.678972] drbd s3: conn( NetworkFailure > -> Unconnected ) > Apr 9 03:39:37 v2 kernel: [141734.678974] drbd s3: receiver terminated > Apr 9 03:39:37 v2 kernel: [141734.678976] drbd s3: Restarting receiver > thread > Apr 9 03:39:37 v2 kernel: [141734.678977] drbd s3: receiver (re)started > Apr 9 03:39:37 v2 kernel: [141734.678987] drbd s3: conn( Unconnected -> > WFConnection ) > Apr 9 03:39:38 v2 kernel: [141735.718898] igb 0000:05:00.1 eth1: igb: > eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None > Apr 9 03:39:38 v2 kernel: [141735.719086] br1: port 1(eth1) entered > forwarding state > Apr 9 03:39:38 v2 kernel: [141735.719095] br1: port 1(eth1) entered > forwarding state > Apr 9 03:39:39 v2 kernel: [141737.154575] drbd s4: peer( Secondary -> > Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) > Apr 9 03:39:39 v2 kernel: [141737.154671] block drbd1: new current UUID > 461FF401E0489AAB:9279A3BA4A3A710B:0E977CC4BB5727A9:0E967CC4BB5727A9 > Apr 9 03:39:39 v2 kernel: [141737.154921] drbd s4: asender terminated > Apr 9 03:39:39 v2 kernel: [141737.154928] drbd s4: Terminating drbd_a_s4 > Apr 9 03:39:39 v2 kernel: [141737.155289] drbd s4: Connection closed > Apr 9 03:39:39 v2 kernel: [141737.155579] drbd s4: conn( NetworkFailure > -> Unconnected ) > Apr 9 03:39:39 v2 kernel: [141737.155583] drbd s4: receiver terminated > Apr 9 03:39:39 v2 kernel: [141737.155585] drbd s4: Restarting receiver > thread > Apr 9 03:39:39 v2 kernel: [141737.155586] drbd s4: receiver (re)started > Apr 9 03:39:39 v2 kernel: [141737.155601] drbd s4: conn( Unconnected -> > WFConnection ) > Apr 9 03:39:41 v2 kernel: [141738.458578] drbd n5: peer( Secondary -> > Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) > Apr 9 03:39:41 v2 kernel: [141738.458671] block drbd8: new current UUID > 808265F24E5A3F21:B63FFF468380B383:240D9C7D536ACB97:240C9C7D536ACB97 > Apr 9 03:39:41 v2 kernel: [141738.458885] drbd n5: asender terminated > Apr 9 03:39:41 v2 kernel: [141738.458893] drbd n5: Terminating drbd_a_n5 > Apr 9 03:39:41 v2 kernel: [141738.459160] drbd n5: Connection closed > Apr 9 03:39:41 v2 kernel: [141738.459316] drbd n5: conn( NetworkFailure > -> Unconnected ) > Apr 9 03:39:41 v2 kernel: [141738.459319] drbd n5: receiver terminated > Apr 9 03:39:41 v2 kernel: [141738.459321] drbd n5: Restarting receiver > thread > Apr 9 03:39:41 v2 kernel: [141738.459322] drbd n5: receiver (re)started > Apr 9 03:39:41 v2 kernel: [141738.459336] drbd n5: conn( Unconnected -> > WFConnection ) > Apr 9 03:39:44 v2 kernel: [141742.202552] drbd r1: peer( Primary -> > Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) > Apr 9 03:39:44 v2 kernel: [141742.202913] drbd r1: asender terminated > Apr 9 03:39:44 v2 kernel: [141742.202920] drbd r1: Terminating drbd_a_r1 > Apr 9 03:39:44 v2 kernel: [141742.203023] drbd r1: Connection closed > Apr 9 03:39:44 v2 kernel: [141742.203039] drbd r1: conn( NetworkFailure > -> Unconnected ) > Apr 9 03:39:44 v2 kernel: [141742.203041] drbd r1: receiver terminated > Apr 9 03:39:44 v2 kernel: [141742.203043] drbd r1: Restarting receiver > thread > Apr 9 03:39:44 v2 kernel: [141742.203044] drbd r1: receiver (re)started > Apr 9 03:39:44 v2 kernel: [141742.203054] drbd r1: conn( Unconnected -> > WFConnection ) > > > > Etc. > > > > > _______________________________________________ > drbd-user mailing list > drbd-user at lists.linbit.com > http://lists.linbit.com/mailman/listinfo/drbd-user > -- Adam Randall http://www.xaren.net AIM: blitz574 Twitter: @randalla0622 "To err is human... to really foul up requires the root password." -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20150410/32c28766/attachment.htm>