[DRBD-user] Kernel Panic occuring when drbd is up & (re)syncing

Mon Oct 4 18:12:14 CEST 2010

On Monday 04 October 2010 18:10:13 Bart Coninckx wrote:
> On Thursday 12 November 2009 18:26:14 Jean-Francois Chevrette wrote:
> > It appears that there is currently a problem with the latest
> > CentOS/Redhat kernel. We have noticed the same problem when using LVM
> > snapshots and a backup technology called R1Soft CDP.
> > 
> > Some related info:
> > http://bugs.centos.org/view.php?id=3869
> > forum.r1soft.com/showthread.php?t=1158
> > 
> > No sign of a bug at bugzilla.redhat.com
> > 
> > For now we have reverted to kernel-2.6.18-128.7.1 on which we did not
> > have any issues for the past 4 hours. Previously, a few seconds after
> > starting a 'drbdadm verify' the kernel panic would occur.
> > 
> > DRBD devs might want to check it out.
> > 
> > Regards,
> > 
> > > Hello,
> > > 
> > > here we have a two nodes setup that are running CentOS 5.4, Xen 3.0
> > > (CentOS RPMs) and DRBD 8.3.2 (again CentOS RPM). Both servers are Dell
> > > PowerEdge 1950 servers with two Quad-Core Xeon processors and 32GB of
> > > memory. The network card used by DRBD is an Intel 82571EB Gigabit
> > > Ethernet card (e1000 driver). Both are connected directly with a
> > > crossover cable.
> > > 
> > > DRBD is configured so that I have one resource (drbd0) on which I have
> > > configured a LVM VolumeGroup which is then sliced in two LVs. Both LVs
> > > are mapped to my Xen VM (PV) as sda and sdb disks.
> > > 
> > > Recently, we've had issues where the node that is in Primary state and
> > > hence running the VM locks up and throws a kernel panic. The situation
> > > seems to indicate that this might be a problem related to DRBD and/or
> > > the network stack because if we disconnect the DRBD resource, this
> > > problem will not occur.
> > > 
> > > Even worse, the problem occur very quickly after we connect the DRBD
> > > resource, either during resynchronization after being out-of-sync for a
> > > while or during normal syncing operations. No errors show up on the
> > > network interface (ifconfig, ethtool)
> > > 
> > > One thing to note is that the kernel panic seems to complain about
> > > checksum functions so that might be related (see below)
> > > 
> > > Here are the relevant informations
> > > 
> > > # rpm -qa | grep -e xen -e drbd
> > > drbd83-8.3.2-6.el5_3
> > > kmod-drbd83-xen-8.3.2-6.el5_3
> > > xen-3.0.3-94.el5
> > > kernel-xen-2.6.18-164.el5
> > > xen-libs-3.0.3-94.el5
> > > 
> > > # cat /etc/drbd.conf
> > > global {
> > > usage-count no;
> > > }
> > > 
> > > common {
> > > protocol C;
> > > 
> > > syncer {
> > > rate 33M;
> > > verify-alg crc32c;
> > > al-extents 1801;
> > > }
> > > net {
> > > cram-hmac-alg sha1;
> > > max-epoch-size 8192;
> > > max-buffers 8192;
> > > }
> > > 
> > > disk {
> > > on-io-error detach;
> > > no-disk-flushes;
> > > no-disk-barrier;
> > > no-md-flushes;
> > > }
> > > }
> > > 
> > > resource drbd0 {
> > > device /dev/drbd0;
> > > disk /dev/sda6;
> > > flexible-meta-disk internal;
> > > on node1 {
> > > address 10.11.1.1:7788;
> > > }
> > > on node2 {
> > > address 10.11.1.2:7788;
> > > }
> > > }
> > > 
> > > ### Kernel Panic ###
> > > Unable to handle kernel paging request
> > > at ffff880011e3cc64 RIP:
> > > [<ffffffff80212bad>] csum_partial+0x56/0x4bc
> > > PGD ed8067
> > > PUD ed9067
> > > PMD f69067
> > > PTE 0
> > > 
> > > Oops: 0000 [1]
> > > SMP
> > > 
> > > last sysfs file: /class/scsi_host/host0/proc_name
> > > CPU 0
> > > 
> > > Modules linked in:
> > > xt_physdev
> > > netconsole
> > > drbd(U)
> > > netloop
> > > netbk
> > > blktap
> > > blkbk
> > > ipt_MASQUERADE
> > > iptable_nat
> > > ip_nat
> > > bridge
> > > ipv6
> > > xfrm_nalgo
> > > crypto_api
> > > xt_tcpudp
> > > xt_state
> > > ip_conntrack_irc
> > > xt_conntrack
> > > ip_conntrack_ftp
> > > xt_mac
> > > xt_length
> > > xt_limit
> > > xt_multiport
> > > ipt_ULOG
> > > ipt_TCPMSS
> > > ipt_TOS
> > > ipt_ttl
> > > ipt_owner
> > > ipt_REJECT
> > > ipt_ecn
> > > ipt_LOG
> > > ipt_recent
> > > ip_conntrack
> > > iptable_mangle
> > > iptable_filter
> > > ip_tables
> > > nfnetlink
> > > x_tables
> > > autofs4
> > > dm_mirror
> > > dm_multipath
> > > scsi_dh
> > > video
> > > hwmon
> > > backlight
> > > sbs
> > > i2c_ec
> > > i2c_core
> > > button
> > > battery
> > > asus_acpi
> > > ac
> > > parport_pc
> > > lp
> > > parport
> > > joydev
> > > ide_cd
> > > e1000e
> > > cdrom
> > > serial_core
> > > i5000_edac
> > > edac_mc
> > > bnx2
> > > serio_raw
> > > pcspkr
> > > sg
> > > dm_raid45
> > > dm_message
> > > dm_region_hash
> > > dm_log
> > > dm_mod
> > > dm_mem_cache
> > > ata_piix
> > > libata
> > > shpchp
> > > megaraid_sas
> > > sd_mod
> > > scsi_mod
> > > ext3
> > > jbd
> > > uhci_hcd
> > > ohci_hcd
> > > ehci_hcd
> > > 
> > > Pid: 12887, comm: drbd0_receiver Tainted: G 2.6.18-128.1.16.el5xen #1
> > > RIP: e030:[<ffffffff80212bad>]
> > > [<ffffffff80212bad>] csum_partial+0x56/0x4bc
> > > RSP: e02b:ffff88000c347718 EFLAGS: 00010202
> > > RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff880010ced500
> > > RDX: 00000000000000e7 RSI: 000000000000039c RDI: ffff880011e3cc64
> > > RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
> > > R10: 0000000025b85e7c R11: 0000000000000002 R12: 0000000000000028
> > > R13: 0000000000000028 R14: ffff88001c56f7b0 R15: 0000000025b85e7c
> > > FS: 00002b391e123f60(0000) GS:ffffffff805ba000(0000)
> > > knlGS:0000000000000000 CS: e033 DS: 0000 ES: 0000
> > > Process drbd0_receiver (pid: 12887, threadinfo ffff88000c346000, task
> > > ffff88001c207820)
> > > Stack:
> > > 000000000000039c
> > > 00000000000005b4
> > > ffffffff8023d496
> > > ffff88001e7e48d8
> > > 
> > > 0000001400000000
> > > ffff8800000003c4
> > > ffff88001c56f7b0
> > > ffff88001e7e48d8
> > > 
> > > ffff88001e7e48ec
> > > ffff88000c3478e8
> > > 
> > > Call Trace:
> > > [<ffffffff8023d496>] skb_checksum+0x11b/0x260
> > > [<ffffffff80411472>] skb_checksum_help+0x71/0xd0
> > > [<ffffffff8853f33e>] :iptable_nat:ip_nat_fn+0x56/0x1c3
> > > [<ffffffff8853f6cf>] :iptable_nat:ip_nat_local_fn+0x32/0xb7
> > > [<ffffffff8023550c>] nf_iterate+0x41/0x7d
> > > [<ffffffff8042f004>] dst_output+0x0/0xe
> > > [<ffffffff80258b28>] nf_hook_slow+0x58/0xbc
> > > [<ffffffff8042f004>] dst_output+0x0/0xe
> > > [<ffffffff802359ab>] ip_queue_xmit+0x41c/0x48c
> > > [<ffffffff8022c1cb>] local_bh_enable+0x9/0xa5
> > > [<ffffffff8020b6b7>] kmem_cache_alloc+0x62/0x6d
> > > [<ffffffff8023668d>] alloc_skb_from_cache+0x74/0x13c
> > > [<ffffffff80222a0b>] tcp_transmit_skb+0x62f/0x667
> > > [<ffffffff8043903a>] tcp_retransmit_skb+0x53d/0x638
> > > [<ffffffff80439353>] tcp_xmit_retransmit_queue+0x21e/0x2bb
> > > [<ffffffff80225cff>] tcp_ack+0x1705/0x1879
> > > [<ffffffff8021c6b1>] tcp_rcv_established+0x804/0x925
> > > [<ffffffff80263710>] schedule_timeout+0x1e/0xad
> > > [<ffffffff8023cef3>] tcp_v4_do_rcv+0x2a/0x2fa
> > > [<ffffffff8040bbfe>] sk_wait_data+0xac/0xbf
> > > [<ffffffff8029b018>] autoremove_wake_function+0x0/0x2e
> > > [<ffffffff80434f71>] tcp_prequeue_process+0x65/0x78
> > > [<ffffffff8021dd39>] tcp_recvmsg+0x492/0xb1f
> > > [<ffffffff80233102>] sock_common_recvmsg+0x2d/0x43
> > > [<ffffffff80233102>] sock_common_recvmsg+0x2d/0x43
> > > [<ffffffff80231c18>] sock_recvmsg+0x101/0x120
> > > [<ffffffff80231c18>] sock_recvmsg+0x101/0x120
> > > [<ffffffff8029b018>] autoremove_wake_function+0x0/0x2e
> > > [<ffffffff80343366>] swiotlb_map_sg+0xf7/0x205
> > > [<ffffffff880b563c>] :megaraid_sas:megasas_make_sgl64+0x78/0xa9
> > > [<ffffffff880b61bc>] :megaraid_sas:megasas_queue_command+0x343/0x3ed
> > > [<ffffffff884e119f>] :drbd:drbd_recv+0x7b/0x109
> > > [<ffffffff884e53b2>] :drbd:receive_DataRequest+0x3b/0x655
> > > [<ffffffff884e1c4b>] :drbd:drbdd+0x77/0x152
> > > [<ffffffff884e4870>] :drbd:drbdd_init+0xea/0x1dc
> > > [<ffffffff884f432a>] :drbd:drbd_thread_setup+0xa2/0x18b
> > > [<ffffffff80260b2c>] child_rip+0xa/0x12
> > > [<ffffffff884f4288>] :drbd:drbd_thread_setup+0x0/0x18b
> > > [<ffffffff80260b22>] child_rip+0x0/0x12
> > > 
> > > 
> > > Code:
> > > 44
> > > 8b
> > > 0f
> > > ff
> > > ca
> > > 83
> > > ee
> > > 04
> > > 48
> > > 83
> > > c7
> > > 04
> > > 4d
> > > 01
> > > c8
> > > 41
> > > 89
> > > d2
> > > 41
> > > 89
> > > 
> > > RIP
> > > [<ffffffff80212bad>] csum_partial+0x56/0x4bc
> > > RSP <ffff88000c347718>
> > > CR2: ffff880011e3cc64
> > > 
> > > Kernel panic - not syncing: Fatal exception
> > > #######
> > > 
> > > 
> > > Any ideas on how to diagnose this properly and eventually find the
> > > culprit?
> > > 
> > > 
> > > Regards,
> > 
> > _______________________________________________
> > drbd-user mailing list
> > drbd-user at lists.linbit.com
> > http://lists.linbit.com/mailman/listinfo/drbd-user
> 
> Jean-Francois,
> 
> thank you for this very elaborate and technically rich reply. I will
> certainly look into your suggestions about using Broadcom cards. I have
> one dual port Broadcom card in this server, but I was using one port
> combined with one port on an Intel e1000 dual port NIC in balanced-rr to
> provide for backup in the event a NIC goes down. Two port NICs usually
> share one chip for two ports, so in case of a problem with the chip, the
> complete DRBD would be out. Reality shows this might be a bad idea though:
> doing a bonnie++ test to the backend storage (RAID5 on 15K rpm disks)
> gives me a 255 MB/sec write performance, doing the same test on the DRBD
> device drops this to 77 MB/sec, even with the MTU set to 9000. It would be
> nice to get as close as possible to the theoretical maximum, so a lot
> needs to be done to get there.
> Step 1 would be changing everything to the broadcom NIC. Any other
> suggestions?
> 
> Thanks a lot,
> 
> Bart
> 
> 
> 
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user

Oops, complety wrong thread. Please disregard ...

B.