Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello,
here we have a two nodes setup that are running CentOS 5.4, Xen 3.0
(CentOS RPMs) and DRBD 8.3.2 (again CentOS RPM). Both servers are Dell
PowerEdge 1950 servers with two Quad-Core Xeon processors and 32GB of
memory. The network card used by DRBD is an Intel 82571EB Gigabit
Ethernet card (e1000 driver). Both are connected directly with a
crossover cable.
DRBD is configured so that I have one resource (drbd0) on which I have
configured a LVM VolumeGroup which is then sliced in two LVs. Both LVs
are mapped to my Xen VM (PV) as sda and sdb disks.
Recently, we've had issues where the node that is in Primary state and
hence running the VM locks up and throws a kernel panic. The situation
seems to indicate that this might be a problem related to DRBD and/or
the network stack because if we disconnect the DRBD resource, this
problem will not occur.
Even worse, the problem occur very quickly after we connect the DRBD
resource, either during resynchronization after being out-of-sync for a
while or during normal syncing operations. No errors show up on the
network interface (ifconfig, ethtool)
One thing to note is that the kernel panic seems to complain about
checksum functions so that might be related (see below)
Here are the relevant informations
# rpm -qa | grep -e xen -e drbd
drbd83-8.3.2-6.el5_3
kmod-drbd83-xen-8.3.2-6.el5_3
xen-3.0.3-94.el5
kernel-xen-2.6.18-164.el5
xen-libs-3.0.3-94.el5
# cat /etc/drbd.conf
global {
usage-count no;
}
common {
protocol C;
syncer {
rate 33M;
verify-alg crc32c;
al-extents 1801;
}
net {
cram-hmac-alg sha1;
max-epoch-size 8192;
max-buffers 8192;
}
disk {
on-io-error detach;
no-disk-flushes;
no-disk-barrier;
no-md-flushes;
}
}
resource drbd0 {
device /dev/drbd0;
disk /dev/sda6;
flexible-meta-disk internal;
on node1 {
address 10.11.1.1:7788;
}
on node2 {
address 10.11.1.2:7788;
}
}
### Kernel Panic ###
Unable to handle kernel paging request
at ffff880011e3cc64 RIP:
[<ffffffff80212bad>] csum_partial+0x56/0x4bc
PGD ed8067
PUD ed9067
PMD f69067
PTE 0
Oops: 0000 [1]
SMP
last sysfs file: /class/scsi_host/host0/proc_name
CPU 0
Modules linked in:
xt_physdev
netconsole
drbd(U)
netloop
netbk
blktap
blkbk
ipt_MASQUERADE
iptable_nat
ip_nat
bridge
ipv6
xfrm_nalgo
crypto_api
xt_tcpudp
xt_state
ip_conntrack_irc
xt_conntrack
ip_conntrack_ftp
xt_mac
xt_length
xt_limit
xt_multiport
ipt_ULOG
ipt_TCPMSS
ipt_TOS
ipt_ttl
ipt_owner
ipt_REJECT
ipt_ecn
ipt_LOG
ipt_recent
ip_conntrack
iptable_mangle
iptable_filter
ip_tables
nfnetlink
x_tables
autofs4
dm_mirror
dm_multipath
scsi_dh
video
hwmon
backlight
sbs
i2c_ec
i2c_core
button
battery
asus_acpi
ac
parport_pc
lp
parport
joydev
ide_cd
e1000e
cdrom
serial_core
i5000_edac
edac_mc
bnx2
serio_raw
pcspkr
sg
dm_raid45
dm_message
dm_region_hash
dm_log
dm_mod
dm_mem_cache
ata_piix
libata
shpchp
megaraid_sas
sd_mod
scsi_mod
ext3
jbd
uhci_hcd
ohci_hcd
ehci_hcd
Pid: 12887, comm: drbd0_receiver Tainted: G 2.6.18-128.1.16.el5xen #1
RIP: e030:[<ffffffff80212bad>]
[<ffffffff80212bad>] csum_partial+0x56/0x4bc
RSP: e02b:ffff88000c347718 EFLAGS: 00010202
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff880010ced500
RDX: 00000000000000e7 RSI: 000000000000039c RDI: ffff880011e3cc64
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000025b85e7c R11: 0000000000000002 R12: 0000000000000028
R13: 0000000000000028 R14: ffff88001c56f7b0 R15: 0000000025b85e7c
FS: 00002b391e123f60(0000) GS:ffffffff805ba000(0000) knlGS:0000000000000000
CS: e033 DS: 0000 ES: 0000
Process drbd0_receiver (pid: 12887, threadinfo ffff88000c346000, task
ffff88001c207820)
Stack:
000000000000039c
00000000000005b4
ffffffff8023d496
ffff88001e7e48d8
0000001400000000
ffff8800000003c4
ffff88001c56f7b0
ffff88001e7e48d8
ffff88001e7e48ec
ffff88000c3478e8
Call Trace:
[<ffffffff8023d496>] skb_checksum+0x11b/0x260
[<ffffffff80411472>] skb_checksum_help+0x71/0xd0
[<ffffffff8853f33e>] :iptable_nat:ip_nat_fn+0x56/0x1c3
[<ffffffff8853f6cf>] :iptable_nat:ip_nat_local_fn+0x32/0xb7
[<ffffffff8023550c>] nf_iterate+0x41/0x7d
[<ffffffff8042f004>] dst_output+0x0/0xe
[<ffffffff80258b28>] nf_hook_slow+0x58/0xbc
[<ffffffff8042f004>] dst_output+0x0/0xe
[<ffffffff802359ab>] ip_queue_xmit+0x41c/0x48c
[<ffffffff8022c1cb>] local_bh_enable+0x9/0xa5
[<ffffffff8020b6b7>] kmem_cache_alloc+0x62/0x6d
[<ffffffff8023668d>] alloc_skb_from_cache+0x74/0x13c
[<ffffffff80222a0b>] tcp_transmit_skb+0x62f/0x667
[<ffffffff8043903a>] tcp_retransmit_skb+0x53d/0x638
[<ffffffff80439353>] tcp_xmit_retransmit_queue+0x21e/0x2bb
[<ffffffff80225cff>] tcp_ack+0x1705/0x1879
[<ffffffff8021c6b1>] tcp_rcv_established+0x804/0x925
[<ffffffff80263710>] schedule_timeout+0x1e/0xad
[<ffffffff8023cef3>] tcp_v4_do_rcv+0x2a/0x2fa
[<ffffffff8040bbfe>] sk_wait_data+0xac/0xbf
[<ffffffff8029b018>] autoremove_wake_function+0x0/0x2e
[<ffffffff80434f71>] tcp_prequeue_process+0x65/0x78
[<ffffffff8021dd39>] tcp_recvmsg+0x492/0xb1f
[<ffffffff80233102>] sock_common_recvmsg+0x2d/0x43
[<ffffffff80233102>] sock_common_recvmsg+0x2d/0x43
[<ffffffff80231c18>] sock_recvmsg+0x101/0x120
[<ffffffff80231c18>] sock_recvmsg+0x101/0x120
[<ffffffff8029b018>] autoremove_wake_function+0x0/0x2e
[<ffffffff80343366>] swiotlb_map_sg+0xf7/0x205
[<ffffffff880b563c>] :megaraid_sas:megasas_make_sgl64+0x78/0xa9
[<ffffffff880b61bc>] :megaraid_sas:megasas_queue_command+0x343/0x3ed
[<ffffffff884e119f>] :drbd:drbd_recv+0x7b/0x109
[<ffffffff884e53b2>] :drbd:receive_DataRequest+0x3b/0x655
[<ffffffff884e1c4b>] :drbd:drbdd+0x77/0x152
[<ffffffff884e4870>] :drbd:drbdd_init+0xea/0x1dc
[<ffffffff884f432a>] :drbd:drbd_thread_setup+0xa2/0x18b
[<ffffffff80260b2c>] child_rip+0xa/0x12
[<ffffffff884f4288>] :drbd:drbd_thread_setup+0x0/0x18b
[<ffffffff80260b22>] child_rip+0x0/0x12
Code:
44
8b
0f
ff
ca
83
ee
04
48
83
c7
04
4d
01
c8
41
89
d2
41
89
RIP
[<ffffffff80212bad>] csum_partial+0x56/0x4bc
RSP <ffff88000c347718>
CR2: ffff880011e3cc64
Kernel panic - not syncing: Fatal exception
#######
Any ideas on how to diagnose this properly and eventually find the culprit?
Regards,
--
Jean-François Chevrette [iWeb]