[DRBD-user] Kernel Panic occuring when drbd is up & (re)syncing

Thu Nov 12 18:26:14 CET 2009

It appears that there is currently a problem with the latest 
CentOS/Redhat kernel. We have noticed the same problem when using LVM 
snapshots and a backup technology called R1Soft CDP.

Some related info:
http://bugs.centos.org/view.php?id=3869
forum.r1soft.com/showthread.php?t=1158

No sign of a bug at bugzilla.redhat.com

For now we have reverted to kernel-2.6.18-128.7.1 on which we did not 
have any issues for the past 4 hours. Previously, a few seconds after 
starting a 'drbdadm verify' the kernel panic would occur.

DRBD devs might want to check it out.

Regards,
-- 
Jean-François Chevrette [iWeb]

On 09-11-09 10:20 AM, Jean-Francois Chevrette wrote:
> Hello,
>
> here we have a two nodes setup that are running CentOS 5.4, Xen 3.0
> (CentOS RPMs) and DRBD 8.3.2 (again CentOS RPM). Both servers are Dell
> PowerEdge 1950 servers with two Quad-Core Xeon processors and 32GB of
> memory. The network card used by DRBD is an Intel 82571EB Gigabit
> Ethernet card (e1000 driver). Both are connected directly with a
> crossover cable.
>
> DRBD is configured so that I have one resource (drbd0) on which I have
> configured a LVM VolumeGroup which is then sliced in two LVs. Both LVs
> are mapped to my Xen VM (PV) as sda and sdb disks.
>
> Recently, we've had issues where the node that is in Primary state and
> hence running the VM locks up and throws a kernel panic. The situation
> seems to indicate that this might be a problem related to DRBD and/or
> the network stack because if we disconnect the DRBD resource, this
> problem will not occur.
>
> Even worse, the problem occur very quickly after we connect the DRBD
> resource, either during resynchronization after being out-of-sync for a
> while or during normal syncing operations. No errors show up on the
> network interface (ifconfig, ethtool)
>
> One thing to note is that the kernel panic seems to complain about
> checksum functions so that might be related (see below)
>
> Here are the relevant informations
>
> # rpm -qa | grep -e xen -e drbd
> drbd83-8.3.2-6.el5_3
> kmod-drbd83-xen-8.3.2-6.el5_3
> xen-3.0.3-94.el5
> kernel-xen-2.6.18-164.el5
> xen-libs-3.0.3-94.el5
>
> # cat /etc/drbd.conf
> global {
> usage-count no;
> }
>
> common {
> protocol C;
>
> syncer {
> rate 33M;
> verify-alg crc32c;
> al-extents 1801;
> }
> net {
> cram-hmac-alg sha1;
> max-epoch-size 8192;
> max-buffers 8192;
> }
>
> disk {
> on-io-error detach;
> no-disk-flushes;
> no-disk-barrier;
> no-md-flushes;
> }
> }
>
> resource drbd0 {
> device /dev/drbd0;
> disk /dev/sda6;
> flexible-meta-disk internal;
> on node1 {
> address 10.11.1.1:7788;
> }
> on node2 {
> address 10.11.1.2:7788;
> }
> }
>
> ### Kernel Panic ###
> Unable to handle kernel paging request
> at ffff880011e3cc64 RIP:
> [<ffffffff80212bad>] csum_partial+0x56/0x4bc
> PGD ed8067
> PUD ed9067
> PMD f69067
> PTE 0
>
> Oops: 0000 [1]
> SMP
>
> last sysfs file: /class/scsi_host/host0/proc_name
> CPU 0
>
> Modules linked in:
> xt_physdev
> netconsole
> drbd(U)
> netloop
> netbk
> blktap
> blkbk
> ipt_MASQUERADE
> iptable_nat
> ip_nat
> bridge
> ipv6
> xfrm_nalgo
> crypto_api
> xt_tcpudp
> xt_state
> ip_conntrack_irc
> xt_conntrack
> ip_conntrack_ftp
> xt_mac
> xt_length
> xt_limit
> xt_multiport
> ipt_ULOG
> ipt_TCPMSS
> ipt_TOS
> ipt_ttl
> ipt_owner
> ipt_REJECT
> ipt_ecn
> ipt_LOG
> ipt_recent
> ip_conntrack
> iptable_mangle
> iptable_filter
> ip_tables
> nfnetlink
> x_tables
> autofs4
> dm_mirror
> dm_multipath
> scsi_dh
> video
> hwmon
> backlight
> sbs
> i2c_ec
> i2c_core
> button
> battery
> asus_acpi
> ac
> parport_pc
> lp
> parport
> joydev
> ide_cd
> e1000e
> cdrom
> serial_core
> i5000_edac
> edac_mc
> bnx2
> serio_raw
> pcspkr
> sg
> dm_raid45
> dm_message
> dm_region_hash
> dm_log
> dm_mod
> dm_mem_cache
> ata_piix
> libata
> shpchp
> megaraid_sas
> sd_mod
> scsi_mod
> ext3
> jbd
> uhci_hcd
> ohci_hcd
> ehci_hcd
>
> Pid: 12887, comm: drbd0_receiver Tainted: G 2.6.18-128.1.16.el5xen #1
> RIP: e030:[<ffffffff80212bad>]
> [<ffffffff80212bad>] csum_partial+0x56/0x4bc
> RSP: e02b:ffff88000c347718 EFLAGS: 00010202
> RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff880010ced500
> RDX: 00000000000000e7 RSI: 000000000000039c RDI: ffff880011e3cc64
> RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000025b85e7c R11: 0000000000000002 R12: 0000000000000028
> R13: 0000000000000028 R14: ffff88001c56f7b0 R15: 0000000025b85e7c
> FS: 00002b391e123f60(0000) GS:ffffffff805ba000(0000) knlGS:0000000000000000
> CS: e033 DS: 0000 ES: 0000
> Process drbd0_receiver (pid: 12887, threadinfo ffff88000c346000, task
> ffff88001c207820)
> Stack:
> 000000000000039c
> 00000000000005b4
> ffffffff8023d496
> ffff88001e7e48d8
>
> 0000001400000000
> ffff8800000003c4
> ffff88001c56f7b0
> ffff88001e7e48d8
>
> ffff88001e7e48ec
> ffff88000c3478e8
>
> Call Trace:
> [<ffffffff8023d496>] skb_checksum+0x11b/0x260
> [<ffffffff80411472>] skb_checksum_help+0x71/0xd0
> [<ffffffff8853f33e>] :iptable_nat:ip_nat_fn+0x56/0x1c3
> [<ffffffff8853f6cf>] :iptable_nat:ip_nat_local_fn+0x32/0xb7
> [<ffffffff8023550c>] nf_iterate+0x41/0x7d
> [<ffffffff8042f004>] dst_output+0x0/0xe
> [<ffffffff80258b28>] nf_hook_slow+0x58/0xbc
> [<ffffffff8042f004>] dst_output+0x0/0xe
> [<ffffffff802359ab>] ip_queue_xmit+0x41c/0x48c
> [<ffffffff8022c1cb>] local_bh_enable+0x9/0xa5
> [<ffffffff8020b6b7>] kmem_cache_alloc+0x62/0x6d
> [<ffffffff8023668d>] alloc_skb_from_cache+0x74/0x13c
> [<ffffffff80222a0b>] tcp_transmit_skb+0x62f/0x667
> [<ffffffff8043903a>] tcp_retransmit_skb+0x53d/0x638
> [<ffffffff80439353>] tcp_xmit_retransmit_queue+0x21e/0x2bb
> [<ffffffff80225cff>] tcp_ack+0x1705/0x1879
> [<ffffffff8021c6b1>] tcp_rcv_established+0x804/0x925
> [<ffffffff80263710>] schedule_timeout+0x1e/0xad
> [<ffffffff8023cef3>] tcp_v4_do_rcv+0x2a/0x2fa
> [<ffffffff8040bbfe>] sk_wait_data+0xac/0xbf
> [<ffffffff8029b018>] autoremove_wake_function+0x0/0x2e
> [<ffffffff80434f71>] tcp_prequeue_process+0x65/0x78
> [<ffffffff8021dd39>] tcp_recvmsg+0x492/0xb1f
> [<ffffffff80233102>] sock_common_recvmsg+0x2d/0x43
> [<ffffffff80233102>] sock_common_recvmsg+0x2d/0x43
> [<ffffffff80231c18>] sock_recvmsg+0x101/0x120
> [<ffffffff80231c18>] sock_recvmsg+0x101/0x120
> [<ffffffff8029b018>] autoremove_wake_function+0x0/0x2e
> [<ffffffff80343366>] swiotlb_map_sg+0xf7/0x205
> [<ffffffff880b563c>] :megaraid_sas:megasas_make_sgl64+0x78/0xa9
> [<ffffffff880b61bc>] :megaraid_sas:megasas_queue_command+0x343/0x3ed
> [<ffffffff884e119f>] :drbd:drbd_recv+0x7b/0x109
> [<ffffffff884e53b2>] :drbd:receive_DataRequest+0x3b/0x655
> [<ffffffff884e1c4b>] :drbd:drbdd+0x77/0x152
> [<ffffffff884e4870>] :drbd:drbdd_init+0xea/0x1dc
> [<ffffffff884f432a>] :drbd:drbd_thread_setup+0xa2/0x18b
> [<ffffffff80260b2c>] child_rip+0xa/0x12
> [<ffffffff884f4288>] :drbd:drbd_thread_setup+0x0/0x18b
> [<ffffffff80260b22>] child_rip+0x0/0x12
>
>
> Code:
> 44
> 8b
> 0f
> ff
> ca
> 83
> ee
> 04
> 48
> 83
> c7
> 04
> 4d
> 01
> c8
> 41
> 89
> d2
> 41
> 89
>
> RIP
> [<ffffffff80212bad>] csum_partial+0x56/0x4bc
> RSP <ffff88000c347718>
> CR2: ffff880011e3cc64
>
> Kernel panic - not syncing: Fatal exception
> #######
>
>
> Any ideas on how to diagnose this properly and eventually find the culprit?
>
>
> Regards,