Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
It appears that there is currently a problem with the latest CentOS/Redhat kernel. We have noticed the same problem when using LVM snapshots and a backup technology called R1Soft CDP. Some related info: http://bugs.centos.org/view.php?id=3869 forum.r1soft.com/showthread.php?t=1158 No sign of a bug at bugzilla.redhat.com For now we have reverted to kernel-2.6.18-128.7.1 on which we did not have any issues for the past 4 hours. Previously, a few seconds after starting a 'drbdadm verify' the kernel panic would occur. DRBD devs might want to check it out. Regards, -- Jean-François Chevrette [iWeb] On 09-11-09 10:20 AM, Jean-Francois Chevrette wrote: > Hello, > > here we have a two nodes setup that are running CentOS 5.4, Xen 3.0 > (CentOS RPMs) and DRBD 8.3.2 (again CentOS RPM). Both servers are Dell > PowerEdge 1950 servers with two Quad-Core Xeon processors and 32GB of > memory. The network card used by DRBD is an Intel 82571EB Gigabit > Ethernet card (e1000 driver). Both are connected directly with a > crossover cable. > > DRBD is configured so that I have one resource (drbd0) on which I have > configured a LVM VolumeGroup which is then sliced in two LVs. Both LVs > are mapped to my Xen VM (PV) as sda and sdb disks. > > Recently, we've had issues where the node that is in Primary state and > hence running the VM locks up and throws a kernel panic. The situation > seems to indicate that this might be a problem related to DRBD and/or > the network stack because if we disconnect the DRBD resource, this > problem will not occur. > > Even worse, the problem occur very quickly after we connect the DRBD > resource, either during resynchronization after being out-of-sync for a > while or during normal syncing operations. No errors show up on the > network interface (ifconfig, ethtool) > > One thing to note is that the kernel panic seems to complain about > checksum functions so that might be related (see below) > > Here are the relevant informations > > # rpm -qa | grep -e xen -e drbd > drbd83-8.3.2-6.el5_3 > kmod-drbd83-xen-8.3.2-6.el5_3 > xen-3.0.3-94.el5 > kernel-xen-2.6.18-164.el5 > xen-libs-3.0.3-94.el5 > > # cat /etc/drbd.conf > global { > usage-count no; > } > > common { > protocol C; > > syncer { > rate 33M; > verify-alg crc32c; > al-extents 1801; > } > net { > cram-hmac-alg sha1; > max-epoch-size 8192; > max-buffers 8192; > } > > disk { > on-io-error detach; > no-disk-flushes; > no-disk-barrier; > no-md-flushes; > } > } > > resource drbd0 { > device /dev/drbd0; > disk /dev/sda6; > flexible-meta-disk internal; > on node1 { > address 10.11.1.1:7788; > } > on node2 { > address 10.11.1.2:7788; > } > } > > ### Kernel Panic ### > Unable to handle kernel paging request > at ffff880011e3cc64 RIP: > [<ffffffff80212bad>] csum_partial+0x56/0x4bc > PGD ed8067 > PUD ed9067 > PMD f69067 > PTE 0 > > Oops: 0000 [1] > SMP > > last sysfs file: /class/scsi_host/host0/proc_name > CPU 0 > > Modules linked in: > xt_physdev > netconsole > drbd(U) > netloop > netbk > blktap > blkbk > ipt_MASQUERADE > iptable_nat > ip_nat > bridge > ipv6 > xfrm_nalgo > crypto_api > xt_tcpudp > xt_state > ip_conntrack_irc > xt_conntrack > ip_conntrack_ftp > xt_mac > xt_length > xt_limit > xt_multiport > ipt_ULOG > ipt_TCPMSS > ipt_TOS > ipt_ttl > ipt_owner > ipt_REJECT > ipt_ecn > ipt_LOG > ipt_recent > ip_conntrack > iptable_mangle > iptable_filter > ip_tables > nfnetlink > x_tables > autofs4 > dm_mirror > dm_multipath > scsi_dh > video > hwmon > backlight > sbs > i2c_ec > i2c_core > button > battery > asus_acpi > ac > parport_pc > lp > parport > joydev > ide_cd > e1000e > cdrom > serial_core > i5000_edac > edac_mc > bnx2 > serio_raw > pcspkr > sg > dm_raid45 > dm_message > dm_region_hash > dm_log > dm_mod > dm_mem_cache > ata_piix > libata > shpchp > megaraid_sas > sd_mod > scsi_mod > ext3 > jbd > uhci_hcd > ohci_hcd > ehci_hcd > > Pid: 12887, comm: drbd0_receiver Tainted: G 2.6.18-128.1.16.el5xen #1 > RIP: e030:[<ffffffff80212bad>] > [<ffffffff80212bad>] csum_partial+0x56/0x4bc > RSP: e02b:ffff88000c347718 EFLAGS: 00010202 > RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff880010ced500 > RDX: 00000000000000e7 RSI: 000000000000039c RDI: ffff880011e3cc64 > RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 > R10: 0000000025b85e7c R11: 0000000000000002 R12: 0000000000000028 > R13: 0000000000000028 R14: ffff88001c56f7b0 R15: 0000000025b85e7c > FS: 00002b391e123f60(0000) GS:ffffffff805ba000(0000) knlGS:0000000000000000 > CS: e033 DS: 0000 ES: 0000 > Process drbd0_receiver (pid: 12887, threadinfo ffff88000c346000, task > ffff88001c207820) > Stack: > 000000000000039c > 00000000000005b4 > ffffffff8023d496 > ffff88001e7e48d8 > > 0000001400000000 > ffff8800000003c4 > ffff88001c56f7b0 > ffff88001e7e48d8 > > ffff88001e7e48ec > ffff88000c3478e8 > > Call Trace: > [<ffffffff8023d496>] skb_checksum+0x11b/0x260 > [<ffffffff80411472>] skb_checksum_help+0x71/0xd0 > [<ffffffff8853f33e>] :iptable_nat:ip_nat_fn+0x56/0x1c3 > [<ffffffff8853f6cf>] :iptable_nat:ip_nat_local_fn+0x32/0xb7 > [<ffffffff8023550c>] nf_iterate+0x41/0x7d > [<ffffffff8042f004>] dst_output+0x0/0xe > [<ffffffff80258b28>] nf_hook_slow+0x58/0xbc > [<ffffffff8042f004>] dst_output+0x0/0xe > [<ffffffff802359ab>] ip_queue_xmit+0x41c/0x48c > [<ffffffff8022c1cb>] local_bh_enable+0x9/0xa5 > [<ffffffff8020b6b7>] kmem_cache_alloc+0x62/0x6d > [<ffffffff8023668d>] alloc_skb_from_cache+0x74/0x13c > [<ffffffff80222a0b>] tcp_transmit_skb+0x62f/0x667 > [<ffffffff8043903a>] tcp_retransmit_skb+0x53d/0x638 > [<ffffffff80439353>] tcp_xmit_retransmit_queue+0x21e/0x2bb > [<ffffffff80225cff>] tcp_ack+0x1705/0x1879 > [<ffffffff8021c6b1>] tcp_rcv_established+0x804/0x925 > [<ffffffff80263710>] schedule_timeout+0x1e/0xad > [<ffffffff8023cef3>] tcp_v4_do_rcv+0x2a/0x2fa > [<ffffffff8040bbfe>] sk_wait_data+0xac/0xbf > [<ffffffff8029b018>] autoremove_wake_function+0x0/0x2e > [<ffffffff80434f71>] tcp_prequeue_process+0x65/0x78 > [<ffffffff8021dd39>] tcp_recvmsg+0x492/0xb1f > [<ffffffff80233102>] sock_common_recvmsg+0x2d/0x43 > [<ffffffff80233102>] sock_common_recvmsg+0x2d/0x43 > [<ffffffff80231c18>] sock_recvmsg+0x101/0x120 > [<ffffffff80231c18>] sock_recvmsg+0x101/0x120 > [<ffffffff8029b018>] autoremove_wake_function+0x0/0x2e > [<ffffffff80343366>] swiotlb_map_sg+0xf7/0x205 > [<ffffffff880b563c>] :megaraid_sas:megasas_make_sgl64+0x78/0xa9 > [<ffffffff880b61bc>] :megaraid_sas:megasas_queue_command+0x343/0x3ed > [<ffffffff884e119f>] :drbd:drbd_recv+0x7b/0x109 > [<ffffffff884e53b2>] :drbd:receive_DataRequest+0x3b/0x655 > [<ffffffff884e1c4b>] :drbd:drbdd+0x77/0x152 > [<ffffffff884e4870>] :drbd:drbdd_init+0xea/0x1dc > [<ffffffff884f432a>] :drbd:drbd_thread_setup+0xa2/0x18b > [<ffffffff80260b2c>] child_rip+0xa/0x12 > [<ffffffff884f4288>] :drbd:drbd_thread_setup+0x0/0x18b > [<ffffffff80260b22>] child_rip+0x0/0x12 > > > Code: > 44 > 8b > 0f > ff > ca > 83 > ee > 04 > 48 > 83 > c7 > 04 > 4d > 01 > c8 > 41 > 89 > d2 > 41 > 89 > > RIP > [<ffffffff80212bad>] csum_partial+0x56/0x4bc > RSP <ffff88000c347718> > CR2: ffff880011e3cc64 > > Kernel panic - not syncing: Fatal exception > ####### > > > Any ideas on how to diagnose this properly and eventually find the culprit? > > > Regards,