[DRBD-user] Many Kernel Crashes since switch to DRBD

Adam Randall randalla at gmail.com
Fri Apr 10 20:12:36 CEST 2015

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


I'm not sure if it is related, but when I was running version 3.16 of the
kernel on our Gentoo servers here I had nothing but trouble with
DRBD+OCFS2. I was using whatever the kernel was providing for DRBD (which I
 believe was 8.4.3), but it went away when I moved back down to a lower
version. It should be noted that 3.16 was only supported for like 3 months
before they moved on to 3.17 and above:
http://en.wikipedia.org/wiki/Linux_kernel#3.x.y_releases

I haven't tried the current stable branch of 3.18, but it might be worth a
try if you can.

On Thu, Apr 9, 2015 at 1:07 PM, Alan Evetts <alan at wrfinance.com> wrote:

> Hi there,
>
> I am reaching out because we have been trying to find stability in our
> move to DRBD as it is amazing in concept, but have struggled for 6 months
> of time.  I am going to just lay out everything we are doing, as the
> problem starts and stops when we introduce/remove DRBD from the picture.
> Obviously, these setups get complicated so hopefully this isn’t too much
> information here.
>
> What we are trying to do is have a pair of Dell R610 machines, each
> running DRBD and xen with about 8 DRBD partitions, each master running half
> of the Xen virtual machines.
>
> Seems, between 1 and 20 days we always receive a kernel panic on 1
> machine, which will often drag down the second machine.  Details of the
> most recent panic are below.
>
> In order to rule out problems we have:
>         - Replace both Dell R610 (have 4 now total, all the same problem)
>         - Upgraded to Debian Jessie  from Debian Wheezy
>         - Running  xen-hypervisor-4.4-amd64,  drbd debian version
> 8.9.2~rc1-2, kernel  3.16.0-4
>         - Switched from the on-board broadcom NICs to Intel E1G44HTBLK  4
> port PCI-e NIC
>         - Upgraded to igb kernel module 5.2.17 and rebuilt it into the
> initrd as well
>
>
> The 2 servers both have lots of resources (64 gigs of ram, quad xeon 2.4,
> 6 * 1 TB drives in a raid 10).  There is a cross over cable on ETH3 for
> DRBD, each drbd instance runs on its own port on ETH3.  The Xen config runs
> on a bridge.
>
> The problem has more or less been the same as we’ve moved through all of
> the hardware and software versions over the past 6 months.  It rotates
> between the servers.
>
> I am hoping someone can spot a problem in our config, or guide us on what
> to try from here.  All 4 dell machines have been patched and had the
> diagnostics ran on them without issue.
>
> The problem.  One of the machines will have a transit queue time-out on an
> interface (oddly, not necessarily the drbd interface - but usually).   From
> there, a panic, and the NIC will start going up and down.  This then starts
> to drive the load up, the machines soon become unresponsive over shell.
> Connected over the dRAC remote access port, sooner or later we see errors
> about the drives not responding, I think this is from the load but I do not
> know for sure.  From this point the machine will sometimes drag down its
> paired DRBD machine, and sometimes not.  The one with the crash needs a
> hard reboot at this point.
>
> We love DRBD, its simplicity  and functionality but it introduces these
> often crashes which are not worth it.  Hoping someone can spot an error we
> are doing here, or have ideas on what to try.
>
> Thanks in advance for any help..  and FYI this crashed used to happen in
> the broadcom queue, now its the intel queue, and only when we have drbd
> enabled.
>
>
>
> Apr  9 03:39:17 v2 kernel: [141714.850432] ------------[ cut here
> ]------------
> Apr  9 03:39:17 v2 kernel: [141714.850521] WARNING: CPU: 0 PID: 0 at
> /build/linux-y7bjb0/linux-3.16.7-ckt4/net/sched/sch_generic.c:264
> dev_watchdog+0x236/0x240()
> Apr  9 03:39:17 v2 kernel: [141714.850527] NETDEV WATCHDOG: eth1 (igb):
> transmit queue 0 timed out
> Apr  9 03:39:17 v2 kernel: [141714.850531] Modules linked in: xt_tcpudp
> xt_physdev iptable_filter ip_tables x_tables xen_netback xen_blkback
> nfnetlink_queue nfnetlink_log nfnetlink bluetooth 6lowpan_iphc rfkill
> xen_gntdev xen_evt
> chn xenfs xen_privcmd nfsd auth_rpcgss oid_registry nfs_acl nfs lockd
> fscache sunrpc bridge stp llc ttm drm_kms_helper joydev drm i2c_algo_bit
> i2c_core pcspkr wmi iTCO_wdt iTCO_vendor_support psmouse dcdbas serio_raw
> evdev tpm_ti
> s tpm lpc_ich mfd_core acpi_power_meter button coretemp i7core_edac
> edac_core shpchp processor thermal_sys loop ipmi_watchdog ipmi_si
> ipmi_poweroff ipmi_devintf ipmi_msghandler drbd lru_cache libcrc32c autofs4
> ext4 crc16 mbcache
> jbd2 dm_mod sg sd_mod crc_t10dif crct10dif_generic sr_mod cdrom ses
> crct10dif_common enclosure ata_generic hid_generic usbhid hid crc32c_intel
> ata_piix ehci_pci uhci_hcd libata igb(O) megaraid_sas ehci_hcd scsi_mod
> usbcore dca pt
> p usb_common pps_core
> Apr  9 03:39:17 v2 kernel: [141714.850609] CPU: 0 PID: 0 Comm: swapper/0
> Tainted: G           O  3.16.0-4-amd64 #1 Debian 3.16.7-ckt4-3
> Apr  9 03:39:17 v2 kernel: [141714.850613] Hardware name: Dell Inc.
> PowerEdge R610/0XDN97, BIOS 6.4.0 07/23/2013
> Apr  9 03:39:17 v2 kernel: [141714.850617]  0000000000000009
> ffffffff815096a7 ffff880079e03e28 ffffffff810676f7
> Apr  9 03:39:17 v2 kernel: [141714.850622]  0000000000000000
> ffff880079e03e78 0000000000000010 0000000000000000
> Apr  9 03:39:17 v2 kernel: [141714.850626]  ffff8800445c8000
> ffffffff8106775c ffffffff81777270 ffffffff00000030
> Apr  9 03:39:17 v2 kernel: [141714.850631] Call Trace:
> Apr  9 03:39:17 v2 kernel: [141714.850635]  <IRQ>  [<ffffffff815096a7>] ?
> dump_stack+0x41/0x51
> Apr  9 03:39:17 v2 kernel: [141714.850652]  [<ffffffff810676f7>] ?
> warn_slowpath_common+0x77/0x90
> Apr  9 03:39:17 v2 kernel: [141714.850660]  [<ffffffff8106775c>] ?
> warn_slowpath_fmt+0x4c/0x50
> Apr  9 03:39:17 v2 kernel: [141714.850669]  [<ffffffff81074647>] ?
> mod_timer+0x127/0x1e0
> Apr  9 03:39:17 v2 kernel: [141714.850676]  [<ffffffff8143ce76>] ?
> dev_watchdog+0x236/0x240
> Apr  9 03:39:17 v2 kernel: [141714.850681]  [<ffffffff8143cc40>] ?
> dev_graft_qdisc+0x70/0x70
> Apr  9 03:39:17 v2 kernel: [141714.850686]  [<ffffffff810729b1>] ?
> call_timer_fn+0x31/0x100
> Apr  9 03:39:17 v2 kernel: [141714.850691]  [<ffffffff8143cc40>] ?
> dev_graft_qdisc+0x70/0x70
> Apr  9 03:39:17 v2 kernel: [141714.850698]  [<ffffffff81073fe9>] ?
> run_timer_softirq+0x209/0x2f0
> Apr  9 03:39:17 v2 kernel: [141714.850704]  [<ffffffff8106c591>] ?
> __do_softirq+0xf1/0x290
> Apr  9 03:39:17 v2 kernel: [141714.850709]  [<ffffffff8106c965>] ?
> irq_exit+0x95/0xa0
> Apr  9 03:39:17 v2 kernel: [141714.850718]  [<ffffffff813579c5>] ?
> xen_evtchn_do_upcall+0x35/0x50
> Apr  9 03:39:17 v2 kernel: [141714.850725]  [<ffffffff8151141e>] ?
> xen_do_hypervisor_callback+0x1e/0x30
> Apr  9 03:39:17 v2 kernel: [141714.850728]  <EOI>  [<ffffffff810013aa>] ?
> xen_hypercall_sched_op+0xa/0x20
> Apr  9 03:39:17 v2 kernel: [141714.850737]  [<ffffffff810013aa>] ?
> xen_hypercall_sched_op+0xa/0x20
> Apr  9 03:39:17 v2 kernel: [141714.850746]  [<ffffffff81009e0c>] ?
> xen_safe_halt+0xc/0x20
> Apr  9 03:39:17 v2 kernel: [141714.850756]  [<ffffffff8101c959>] ?
> default_idle+0x19/0xb0
> Apr  9 03:39:17 v2 kernel: [141714.850764]  [<ffffffff810a7dc0>] ?
> cpu_startup_entry+0x340/0x400
> Apr  9 03:39:17 v2 kernel: [141714.850770]  [<ffffffff81902071>] ?
> start_kernel+0x492/0x49d
> Apr  9 03:39:17 v2 kernel: [141714.850775]  [<ffffffff81901a04>] ?
> set_init_arg+0x4e/0x4e
> Apr  9 03:39:17 v2 kernel: [141714.850781]  [<ffffffff81903f64>] ?
> xen_start_kernel+0x569/0x573
> Apr  9 03:39:17 v2 kernel: [141714.850785] ---[ end trace ee11063cf033829a
> ]---
> Apr  9 03:39:17 v2 kernel: [141714.871945] br1: port 1(eth1) entered
> disabled state
> Apr  9 03:39:20 v2 kernel: [141718.210743] igb 0000:05:00.1 eth1: igb:
> eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
> Apr  9 03:39:20 v2 kernel: [141718.210913] br1: port 1(eth1) entered
> forwarding state
> Apr  9 03:39:20 v2 kernel: [141718.210923] br1: port 1(eth1) entered
> forwarding state
> Apr  9 03:39:26 v2 kernel: [141723.863194] br1: port 1(eth1) entered
> disabled state
> Apr  9 03:39:30 v2 kernel: [141727.650897] igb 0000:05:00.1 eth1: igb:
> eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
> Apr  9 03:39:30 v2 kernel: [141727.651040] br1: port 1(eth1) entered
> forwarding state
> Apr  9 03:39:30 v2 kernel: [141727.651053] br1: port 1(eth1) entered
> forwarding state
> Apr  9 03:39:31 v2 kernel: [141728.890509] ata1: lost interrupt (Status
> 0x50)
> Apr  9 03:39:31 v2 kernel: [141728.890560] sr 1:0:0:0: CDB:
> Apr  9 03:39:31 v2 kernel: [141728.890563] Get event status notification:
> 4a 01 00 00 10 00 00 00 08 00
> Apr  9 03:39:31 v2 kernel: [141728.890630] ata1: hard resetting link
> Apr  9 03:39:31 v2 kernel: [141729.366592] ata1: SATA link up 1.5 Gbps
> (SStatus 113 SControl 300)
> Apr  9 03:39:32 v2 kernel: [141729.406749] ata1.00: configured for UDMA/100
> Apr  9 03:39:32 v2 kernel: [141729.408192] ata1: EH complete
> Apr  9 03:39:35 v2 kernel: [141732.711653] br1: port 1(eth1) entered
> disabled state
> Apr  9 03:39:37 v2 kernel: [141734.678485] drbd s3: peer( Primary ->
> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
> Apr  9 03:39:37 v2 kernel: [141734.678846] drbd s3: asender terminated
> Apr  9 03:39:37 v2 kernel: [141734.678852] drbd s3: Terminating drbd_a_s3
> Apr  9 03:39:37 v2 kernel: [141734.678956] drbd s3: Connection closed
> Apr  9 03:39:37 v2 kernel: [141734.678972] drbd s3: conn( NetworkFailure
> -> Unconnected )
> Apr  9 03:39:37 v2 kernel: [141734.678974] drbd s3: receiver terminated
> Apr  9 03:39:37 v2 kernel: [141734.678976] drbd s3: Restarting receiver
> thread
> Apr  9 03:39:37 v2 kernel: [141734.678977] drbd s3: receiver (re)started
> Apr  9 03:39:37 v2 kernel: [141734.678987] drbd s3: conn( Unconnected ->
> WFConnection )
> Apr  9 03:39:38 v2 kernel: [141735.718898] igb 0000:05:00.1 eth1: igb:
> eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
> Apr  9 03:39:38 v2 kernel: [141735.719086] br1: port 1(eth1) entered
> forwarding state
> Apr  9 03:39:38 v2 kernel: [141735.719095] br1: port 1(eth1) entered
> forwarding state
> Apr  9 03:39:39 v2 kernel: [141737.154575] drbd s4: peer( Secondary ->
> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
> Apr  9 03:39:39 v2 kernel: [141737.154671] block drbd1: new current UUID
> 461FF401E0489AAB:9279A3BA4A3A710B:0E977CC4BB5727A9:0E967CC4BB5727A9
> Apr  9 03:39:39 v2 kernel: [141737.154921] drbd s4: asender terminated
> Apr  9 03:39:39 v2 kernel: [141737.154928] drbd s4: Terminating drbd_a_s4
> Apr  9 03:39:39 v2 kernel: [141737.155289] drbd s4: Connection closed
> Apr  9 03:39:39 v2 kernel: [141737.155579] drbd s4: conn( NetworkFailure
> -> Unconnected )
> Apr  9 03:39:39 v2 kernel: [141737.155583] drbd s4: receiver terminated
> Apr  9 03:39:39 v2 kernel: [141737.155585] drbd s4: Restarting receiver
> thread
> Apr  9 03:39:39 v2 kernel: [141737.155586] drbd s4: receiver (re)started
> Apr  9 03:39:39 v2 kernel: [141737.155601] drbd s4: conn( Unconnected ->
> WFConnection )
> Apr  9 03:39:41 v2 kernel: [141738.458578] drbd n5: peer( Secondary ->
> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
> Apr  9 03:39:41 v2 kernel: [141738.458671] block drbd8: new current UUID
> 808265F24E5A3F21:B63FFF468380B383:240D9C7D536ACB97:240C9C7D536ACB97
> Apr  9 03:39:41 v2 kernel: [141738.458885] drbd n5: asender terminated
> Apr  9 03:39:41 v2 kernel: [141738.458893] drbd n5: Terminating drbd_a_n5
> Apr  9 03:39:41 v2 kernel: [141738.459160] drbd n5: Connection closed
> Apr  9 03:39:41 v2 kernel: [141738.459316] drbd n5: conn( NetworkFailure
> -> Unconnected )
> Apr  9 03:39:41 v2 kernel: [141738.459319] drbd n5: receiver terminated
> Apr  9 03:39:41 v2 kernel: [141738.459321] drbd n5: Restarting receiver
> thread
> Apr  9 03:39:41 v2 kernel: [141738.459322] drbd n5: receiver (re)started
> Apr  9 03:39:41 v2 kernel: [141738.459336] drbd n5: conn( Unconnected ->
> WFConnection )
> Apr  9 03:39:44 v2 kernel: [141742.202552] drbd r1: peer( Primary ->
> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
> Apr  9 03:39:44 v2 kernel: [141742.202913] drbd r1: asender terminated
> Apr  9 03:39:44 v2 kernel: [141742.202920] drbd r1: Terminating drbd_a_r1
> Apr  9 03:39:44 v2 kernel: [141742.203023] drbd r1: Connection closed
> Apr  9 03:39:44 v2 kernel: [141742.203039] drbd r1: conn( NetworkFailure
> -> Unconnected )
> Apr  9 03:39:44 v2 kernel: [141742.203041] drbd r1: receiver terminated
> Apr  9 03:39:44 v2 kernel: [141742.203043] drbd r1: Restarting receiver
> thread
> Apr  9 03:39:44 v2 kernel: [141742.203044] drbd r1: receiver (re)started
> Apr  9 03:39:44 v2 kernel: [141742.203054] drbd r1: conn( Unconnected ->
> WFConnection )
>
>
>
> Etc.
>
>
>
>
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user
>



-- 
Adam Randall
http://www.xaren.net
AIM: blitz574
Twitter: @randalla0622

"To err is human... to really foul up requires the root password."
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20150410/32c28766/attachment.htm>


More information about the drbd-user mailing list