[DRBD-user] Kernel Panic occuring when drbd is up & (re)syncing

Mon Nov 9 16:20:55 CET 2009

Hello,

here we have a two nodes setup that are running CentOS 5.4, Xen 3.0 
(CentOS RPMs) and DRBD 8.3.2 (again CentOS RPM). Both servers are Dell 
PowerEdge 1950 servers with two Quad-Core Xeon processors and 32GB of 
memory. The network card used by DRBD is an Intel 82571EB Gigabit 
Ethernet card (e1000 driver). Both are connected directly with a 
crossover cable.

DRBD is configured so that I have one resource (drbd0) on which I have 
configured a LVM VolumeGroup which is then sliced in two LVs. Both LVs 
are mapped to my Xen VM (PV) as sda and sdb disks.

Recently, we've had issues where the node that is in Primary state and 
hence running the VM locks up and throws a kernel panic. The situation 
seems to indicate that this might be a problem related to DRBD and/or 
the network stack because if we disconnect the DRBD resource, this 
problem will not occur.

Even worse, the problem occur very quickly after we connect the DRBD 
resource, either during resynchronization after being out-of-sync for a 
while or during normal syncing operations. No errors show up on the 
network interface (ifconfig, ethtool)

One thing to note is that the kernel panic seems to complain about 
checksum functions so that might be related (see below)

Here are the relevant informations

# rpm -qa | grep -e xen -e drbd
drbd83-8.3.2-6.el5_3
kmod-drbd83-xen-8.3.2-6.el5_3
xen-3.0.3-94.el5
kernel-xen-2.6.18-164.el5
xen-libs-3.0.3-94.el5

# cat /etc/drbd.conf
global {
   usage-count no;
}

common {
   protocol C;

   syncer {
     rate 33M;
     verify-alg crc32c;
     al-extents 1801;
   }
   net {
     cram-hmac-alg sha1;
     max-epoch-size 8192;
     max-buffers 8192;
   }

   disk {
     on-io-error detach;
     no-disk-flushes;
     no-disk-barrier;
     no-md-flushes;
   }
}

resource drbd0 {
   device /dev/drbd0;
   disk /dev/sda6;
   flexible-meta-disk internal;
   on node1 {
     address 10.11.1.1:7788;
   }
   on node2 {
     address 10.11.1.2:7788;
   }
}

### Kernel Panic ###
Unable to handle kernel paging request
  at ffff880011e3cc64 RIP:
  [<ffffffff80212bad>] csum_partial+0x56/0x4bc
PGD ed8067
PUD ed9067
PMD f69067
PTE 0

Oops: 0000 [1]
SMP

last sysfs file: /class/scsi_host/host0/proc_name
CPU 0

Modules linked in:
  xt_physdev
  netconsole
  drbd(U)
  netloop
  netbk
  blktap
  blkbk
  ipt_MASQUERADE
  iptable_nat
  ip_nat
  bridge
  ipv6
  xfrm_nalgo
  crypto_api
  xt_tcpudp
  xt_state
  ip_conntrack_irc
  xt_conntrack
  ip_conntrack_ftp
  xt_mac
  xt_length
  xt_limit
  xt_multiport
  ipt_ULOG
  ipt_TCPMSS
  ipt_TOS
  ipt_ttl
  ipt_owner
  ipt_REJECT
  ipt_ecn
  ipt_LOG
  ipt_recent
  ip_conntrack
  iptable_mangle
  iptable_filter
  ip_tables
  nfnetlink
  x_tables
  autofs4
  dm_mirror
  dm_multipath
  scsi_dh
  video
  hwmon
  backlight
  sbs
  i2c_ec
  i2c_core
  button
  battery
  asus_acpi
  ac
  parport_pc
  lp
  parport
  joydev
  ide_cd
  e1000e
  cdrom
  serial_core
  i5000_edac
  edac_mc
  bnx2
  serio_raw
  pcspkr
  sg
  dm_raid45
  dm_message
  dm_region_hash
  dm_log
  dm_mod
  dm_mem_cache
  ata_piix
  libata
  shpchp
  megaraid_sas
  sd_mod
  scsi_mod
  ext3
  jbd
  uhci_hcd
  ohci_hcd
  ehci_hcd

Pid: 12887, comm: drbd0_receiver Tainted: G      2.6.18-128.1.16.el5xen #1
RIP: e030:[<ffffffff80212bad>]
  [<ffffffff80212bad>] csum_partial+0x56/0x4bc
RSP: e02b:ffff88000c347718  EFLAGS: 00010202
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff880010ced500
RDX: 00000000000000e7 RSI: 000000000000039c RDI: ffff880011e3cc64
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000025b85e7c R11: 0000000000000002 R12: 0000000000000028
R13: 0000000000000028 R14: ffff88001c56f7b0 R15: 0000000025b85e7c
FS:  00002b391e123f60(0000) GS:ffffffff805ba000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000
Process drbd0_receiver (pid: 12887, threadinfo ffff88000c346000, task 
ffff88001c207820)
Stack:
  000000000000039c
  00000000000005b4
  ffffffff8023d496
  ffff88001e7e48d8

  0000001400000000
  ffff8800000003c4
  ffff88001c56f7b0
  ffff88001e7e48d8

  ffff88001e7e48ec
  ffff88000c3478e8

Call Trace:
  [<ffffffff8023d496>] skb_checksum+0x11b/0x260
  [<ffffffff80411472>] skb_checksum_help+0x71/0xd0
  [<ffffffff8853f33e>] :iptable_nat:ip_nat_fn+0x56/0x1c3
  [<ffffffff8853f6cf>] :iptable_nat:ip_nat_local_fn+0x32/0xb7
  [<ffffffff8023550c>] nf_iterate+0x41/0x7d
  [<ffffffff8042f004>] dst_output+0x0/0xe
  [<ffffffff80258b28>] nf_hook_slow+0x58/0xbc
  [<ffffffff8042f004>] dst_output+0x0/0xe
  [<ffffffff802359ab>] ip_queue_xmit+0x41c/0x48c
  [<ffffffff8022c1cb>] local_bh_enable+0x9/0xa5
  [<ffffffff8020b6b7>] kmem_cache_alloc+0x62/0x6d
  [<ffffffff8023668d>] alloc_skb_from_cache+0x74/0x13c
  [<ffffffff80222a0b>] tcp_transmit_skb+0x62f/0x667
  [<ffffffff8043903a>] tcp_retransmit_skb+0x53d/0x638
  [<ffffffff80439353>] tcp_xmit_retransmit_queue+0x21e/0x2bb
  [<ffffffff80225cff>] tcp_ack+0x1705/0x1879
  [<ffffffff8021c6b1>] tcp_rcv_established+0x804/0x925
  [<ffffffff80263710>] schedule_timeout+0x1e/0xad
  [<ffffffff8023cef3>] tcp_v4_do_rcv+0x2a/0x2fa
  [<ffffffff8040bbfe>] sk_wait_data+0xac/0xbf
  [<ffffffff8029b018>] autoremove_wake_function+0x0/0x2e
  [<ffffffff80434f71>] tcp_prequeue_process+0x65/0x78
  [<ffffffff8021dd39>] tcp_recvmsg+0x492/0xb1f
  [<ffffffff80233102>] sock_common_recvmsg+0x2d/0x43
  [<ffffffff80233102>] sock_common_recvmsg+0x2d/0x43
  [<ffffffff80231c18>] sock_recvmsg+0x101/0x120
  [<ffffffff80231c18>] sock_recvmsg+0x101/0x120
  [<ffffffff8029b018>] autoremove_wake_function+0x0/0x2e
  [<ffffffff80343366>] swiotlb_map_sg+0xf7/0x205
  [<ffffffff880b563c>] :megaraid_sas:megasas_make_sgl64+0x78/0xa9
  [<ffffffff880b61bc>] :megaraid_sas:megasas_queue_command+0x343/0x3ed
  [<ffffffff884e119f>] :drbd:drbd_recv+0x7b/0x109
  [<ffffffff884e53b2>] :drbd:receive_DataRequest+0x3b/0x655
  [<ffffffff884e1c4b>] :drbd:drbdd+0x77/0x152
  [<ffffffff884e4870>] :drbd:drbdd_init+0xea/0x1dc
  [<ffffffff884f432a>] :drbd:drbd_thread_setup+0xa2/0x18b
  [<ffffffff80260b2c>] child_rip+0xa/0x12
  [<ffffffff884f4288>] :drbd:drbd_thread_setup+0x0/0x18b
  [<ffffffff80260b22>] child_rip+0x0/0x12

Code:
44
8b
0f
ff
ca
83
ee
04
48
83
c7
04
4d
01
c8
41
89
d2
41
89

RIP
  [<ffffffff80212bad>] csum_partial+0x56/0x4bc
  RSP <ffff88000c347718>
CR2: ffff880011e3cc64

Kernel panic - not syncing: Fatal exception
#######

Any ideas on how to diagnose this properly and eventually find the culprit?

Regards,
-- 
Jean-François Chevrette [iWeb]