Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Thursday 12 November 2009 18:26:14 Jean-Francois Chevrette wrote: > It appears that there is currently a problem with the latest > CentOS/Redhat kernel. We have noticed the same problem when using LVM > snapshots and a backup technology called R1Soft CDP. > > Some related info: > http://bugs.centos.org/view.php?id=3869 > forum.r1soft.com/showthread.php?t=1158 > > No sign of a bug at bugzilla.redhat.com > > For now we have reverted to kernel-2.6.18-128.7.1 on which we did not > have any issues for the past 4 hours. Previously, a few seconds after > starting a 'drbdadm verify' the kernel panic would occur. > > DRBD devs might want to check it out. > > Regards, > > > Hello, > > > > here we have a two nodes setup that are running CentOS 5.4, Xen 3.0 > > (CentOS RPMs) and DRBD 8.3.2 (again CentOS RPM). Both servers are Dell > > PowerEdge 1950 servers with two Quad-Core Xeon processors and 32GB of > > memory. The network card used by DRBD is an Intel 82571EB Gigabit > > Ethernet card (e1000 driver). Both are connected directly with a > > crossover cable. > > > > DRBD is configured so that I have one resource (drbd0) on which I have > > configured a LVM VolumeGroup which is then sliced in two LVs. Both LVs > > are mapped to my Xen VM (PV) as sda and sdb disks. > > > > Recently, we've had issues where the node that is in Primary state and > > hence running the VM locks up and throws a kernel panic. The situation > > seems to indicate that this might be a problem related to DRBD and/or > > the network stack because if we disconnect the DRBD resource, this > > problem will not occur. > > > > Even worse, the problem occur very quickly after we connect the DRBD > > resource, either during resynchronization after being out-of-sync for a > > while or during normal syncing operations. No errors show up on the > > network interface (ifconfig, ethtool) > > > > One thing to note is that the kernel panic seems to complain about > > checksum functions so that might be related (see below) > > > > Here are the relevant informations > > > > # rpm -qa | grep -e xen -e drbd > > drbd83-8.3.2-6.el5_3 > > kmod-drbd83-xen-8.3.2-6.el5_3 > > xen-3.0.3-94.el5 > > kernel-xen-2.6.18-164.el5 > > xen-libs-3.0.3-94.el5 > > > > # cat /etc/drbd.conf > > global { > > usage-count no; > > } > > > > common { > > protocol C; > > > > syncer { > > rate 33M; > > verify-alg crc32c; > > al-extents 1801; > > } > > net { > > cram-hmac-alg sha1; > > max-epoch-size 8192; > > max-buffers 8192; > > } > > > > disk { > > on-io-error detach; > > no-disk-flushes; > > no-disk-barrier; > > no-md-flushes; > > } > > } > > > > resource drbd0 { > > device /dev/drbd0; > > disk /dev/sda6; > > flexible-meta-disk internal; > > on node1 { > > address 10.11.1.1:7788; > > } > > on node2 { > > address 10.11.1.2:7788; > > } > > } > > > > ### Kernel Panic ### > > Unable to handle kernel paging request > > at ffff880011e3cc64 RIP: > > [<ffffffff80212bad>] csum_partial+0x56/0x4bc > > PGD ed8067 > > PUD ed9067 > > PMD f69067 > > PTE 0 > > > > Oops: 0000 [1] > > SMP > > > > last sysfs file: /class/scsi_host/host0/proc_name > > CPU 0 > > > > Modules linked in: > > xt_physdev > > netconsole > > drbd(U) > > netloop > > netbk > > blktap > > blkbk > > ipt_MASQUERADE > > iptable_nat > > ip_nat > > bridge > > ipv6 > > xfrm_nalgo > > crypto_api > > xt_tcpudp > > xt_state > > ip_conntrack_irc > > xt_conntrack > > ip_conntrack_ftp > > xt_mac > > xt_length > > xt_limit > > xt_multiport > > ipt_ULOG > > ipt_TCPMSS > > ipt_TOS > > ipt_ttl > > ipt_owner > > ipt_REJECT > > ipt_ecn > > ipt_LOG > > ipt_recent > > ip_conntrack > > iptable_mangle > > iptable_filter > > ip_tables > > nfnetlink > > x_tables > > autofs4 > > dm_mirror > > dm_multipath > > scsi_dh > > video > > hwmon > > backlight > > sbs > > i2c_ec > > i2c_core > > button > > battery > > asus_acpi > > ac > > parport_pc > > lp > > parport > > joydev > > ide_cd > > e1000e > > cdrom > > serial_core > > i5000_edac > > edac_mc > > bnx2 > > serio_raw > > pcspkr > > sg > > dm_raid45 > > dm_message > > dm_region_hash > > dm_log > > dm_mod > > dm_mem_cache > > ata_piix > > libata > > shpchp > > megaraid_sas > > sd_mod > > scsi_mod > > ext3 > > jbd > > uhci_hcd > > ohci_hcd > > ehci_hcd > > > > Pid: 12887, comm: drbd0_receiver Tainted: G 2.6.18-128.1.16.el5xen #1 > > RIP: e030:[<ffffffff80212bad>] > > [<ffffffff80212bad>] csum_partial+0x56/0x4bc > > RSP: e02b:ffff88000c347718 EFLAGS: 00010202 > > RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff880010ced500 > > RDX: 00000000000000e7 RSI: 000000000000039c RDI: ffff880011e3cc64 > > RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 > > R10: 0000000025b85e7c R11: 0000000000000002 R12: 0000000000000028 > > R13: 0000000000000028 R14: ffff88001c56f7b0 R15: 0000000025b85e7c > > FS: 00002b391e123f60(0000) GS:ffffffff805ba000(0000) > > knlGS:0000000000000000 CS: e033 DS: 0000 ES: 0000 > > Process drbd0_receiver (pid: 12887, threadinfo ffff88000c346000, task > > ffff88001c207820) > > Stack: > > 000000000000039c > > 00000000000005b4 > > ffffffff8023d496 > > ffff88001e7e48d8 > > > > 0000001400000000 > > ffff8800000003c4 > > ffff88001c56f7b0 > > ffff88001e7e48d8 > > > > ffff88001e7e48ec > > ffff88000c3478e8 > > > > Call Trace: > > [<ffffffff8023d496>] skb_checksum+0x11b/0x260 > > [<ffffffff80411472>] skb_checksum_help+0x71/0xd0 > > [<ffffffff8853f33e>] :iptable_nat:ip_nat_fn+0x56/0x1c3 > > [<ffffffff8853f6cf>] :iptable_nat:ip_nat_local_fn+0x32/0xb7 > > [<ffffffff8023550c>] nf_iterate+0x41/0x7d > > [<ffffffff8042f004>] dst_output+0x0/0xe > > [<ffffffff80258b28>] nf_hook_slow+0x58/0xbc > > [<ffffffff8042f004>] dst_output+0x0/0xe > > [<ffffffff802359ab>] ip_queue_xmit+0x41c/0x48c > > [<ffffffff8022c1cb>] local_bh_enable+0x9/0xa5 > > [<ffffffff8020b6b7>] kmem_cache_alloc+0x62/0x6d > > [<ffffffff8023668d>] alloc_skb_from_cache+0x74/0x13c > > [<ffffffff80222a0b>] tcp_transmit_skb+0x62f/0x667 > > [<ffffffff8043903a>] tcp_retransmit_skb+0x53d/0x638 > > [<ffffffff80439353>] tcp_xmit_retransmit_queue+0x21e/0x2bb > > [<ffffffff80225cff>] tcp_ack+0x1705/0x1879 > > [<ffffffff8021c6b1>] tcp_rcv_established+0x804/0x925 > > [<ffffffff80263710>] schedule_timeout+0x1e/0xad > > [<ffffffff8023cef3>] tcp_v4_do_rcv+0x2a/0x2fa > > [<ffffffff8040bbfe>] sk_wait_data+0xac/0xbf > > [<ffffffff8029b018>] autoremove_wake_function+0x0/0x2e > > [<ffffffff80434f71>] tcp_prequeue_process+0x65/0x78 > > [<ffffffff8021dd39>] tcp_recvmsg+0x492/0xb1f > > [<ffffffff80233102>] sock_common_recvmsg+0x2d/0x43 > > [<ffffffff80233102>] sock_common_recvmsg+0x2d/0x43 > > [<ffffffff80231c18>] sock_recvmsg+0x101/0x120 > > [<ffffffff80231c18>] sock_recvmsg+0x101/0x120 > > [<ffffffff8029b018>] autoremove_wake_function+0x0/0x2e > > [<ffffffff80343366>] swiotlb_map_sg+0xf7/0x205 > > [<ffffffff880b563c>] :megaraid_sas:megasas_make_sgl64+0x78/0xa9 > > [<ffffffff880b61bc>] :megaraid_sas:megasas_queue_command+0x343/0x3ed > > [<ffffffff884e119f>] :drbd:drbd_recv+0x7b/0x109 > > [<ffffffff884e53b2>] :drbd:receive_DataRequest+0x3b/0x655 > > [<ffffffff884e1c4b>] :drbd:drbdd+0x77/0x152 > > [<ffffffff884e4870>] :drbd:drbdd_init+0xea/0x1dc > > [<ffffffff884f432a>] :drbd:drbd_thread_setup+0xa2/0x18b > > [<ffffffff80260b2c>] child_rip+0xa/0x12 > > [<ffffffff884f4288>] :drbd:drbd_thread_setup+0x0/0x18b > > [<ffffffff80260b22>] child_rip+0x0/0x12 > > > > > > Code: > > 44 > > 8b > > 0f > > ff > > ca > > 83 > > ee > > 04 > > 48 > > 83 > > c7 > > 04 > > 4d > > 01 > > c8 > > 41 > > 89 > > d2 > > 41 > > 89 > > > > RIP > > [<ffffffff80212bad>] csum_partial+0x56/0x4bc > > RSP <ffff88000c347718> > > CR2: ffff880011e3cc64 > > > > Kernel panic - not syncing: Fatal exception > > ####### > > > > > > Any ideas on how to diagnose this properly and eventually find the > > culprit? > > > > > > Regards, > > _______________________________________________ > drbd-user mailing list > drbd-user at lists.linbit.com > http://lists.linbit.com/mailman/listinfo/drbd-user Jean-Francois, thank you for this very elaborate and technically rich reply. I will certainly look into your suggestions about using Broadcom cards. I have one dual port Broadcom card in this server, but I was using one port combined with one port on an Intel e1000 dual port NIC in balanced-rr to provide for backup in the event a NIC goes down. Two port NICs usually share one chip for two ports, so in case of a problem with the chip, the complete DRBD would be out. Reality shows this might be a bad idea though: doing a bonnie++ test to the backend storage (RAID5 on 15K rpm disks) gives me a 255 MB/sec write performance, doing the same test on the DRBD device drops this to 77 MB/sec, even with the MTU set to 9000. It would be nice to get as close as possible to the theoretical maximum, so a lot needs to be done to get there. Step 1 would be changing everything to the broadcom NIC. Any other suggestions? Thanks a lot, Bart