Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Monday 04 October 2010 18:10:13 Bart Coninckx wrote: > On Thursday 12 November 2009 18:26:14 Jean-Francois Chevrette wrote: > > It appears that there is currently a problem with the latest > > CentOS/Redhat kernel. We have noticed the same problem when using LVM > > snapshots and a backup technology called R1Soft CDP. > > > > Some related info: > > http://bugs.centos.org/view.php?id=3869 > > forum.r1soft.com/showthread.php?t=1158 > > > > No sign of a bug at bugzilla.redhat.com > > > > For now we have reverted to kernel-2.6.18-128.7.1 on which we did not > > have any issues for the past 4 hours. Previously, a few seconds after > > starting a 'drbdadm verify' the kernel panic would occur. > > > > DRBD devs might want to check it out. > > > > Regards, > > > > > Hello, > > > > > > here we have a two nodes setup that are running CentOS 5.4, Xen 3.0 > > > (CentOS RPMs) and DRBD 8.3.2 (again CentOS RPM). Both servers are Dell > > > PowerEdge 1950 servers with two Quad-Core Xeon processors and 32GB of > > > memory. The network card used by DRBD is an Intel 82571EB Gigabit > > > Ethernet card (e1000 driver). Both are connected directly with a > > > crossover cable. > > > > > > DRBD is configured so that I have one resource (drbd0) on which I have > > > configured a LVM VolumeGroup which is then sliced in two LVs. Both LVs > > > are mapped to my Xen VM (PV) as sda and sdb disks. > > > > > > Recently, we've had issues where the node that is in Primary state and > > > hence running the VM locks up and throws a kernel panic. The situation > > > seems to indicate that this might be a problem related to DRBD and/or > > > the network stack because if we disconnect the DRBD resource, this > > > problem will not occur. > > > > > > Even worse, the problem occur very quickly after we connect the DRBD > > > resource, either during resynchronization after being out-of-sync for a > > > while or during normal syncing operations. No errors show up on the > > > network interface (ifconfig, ethtool) > > > > > > One thing to note is that the kernel panic seems to complain about > > > checksum functions so that might be related (see below) > > > > > > Here are the relevant informations > > > > > > # rpm -qa | grep -e xen -e drbd > > > drbd83-8.3.2-6.el5_3 > > > kmod-drbd83-xen-8.3.2-6.el5_3 > > > xen-3.0.3-94.el5 > > > kernel-xen-2.6.18-164.el5 > > > xen-libs-3.0.3-94.el5 > > > > > > # cat /etc/drbd.conf > > > global { > > > usage-count no; > > > } > > > > > > common { > > > protocol C; > > > > > > syncer { > > > rate 33M; > > > verify-alg crc32c; > > > al-extents 1801; > > > } > > > net { > > > cram-hmac-alg sha1; > > > max-epoch-size 8192; > > > max-buffers 8192; > > > } > > > > > > disk { > > > on-io-error detach; > > > no-disk-flushes; > > > no-disk-barrier; > > > no-md-flushes; > > > } > > > } > > > > > > resource drbd0 { > > > device /dev/drbd0; > > > disk /dev/sda6; > > > flexible-meta-disk internal; > > > on node1 { > > > address 10.11.1.1:7788; > > > } > > > on node2 { > > > address 10.11.1.2:7788; > > > } > > > } > > > > > > ### Kernel Panic ### > > > Unable to handle kernel paging request > > > at ffff880011e3cc64 RIP: > > > [<ffffffff80212bad>] csum_partial+0x56/0x4bc > > > PGD ed8067 > > > PUD ed9067 > > > PMD f69067 > > > PTE 0 > > > > > > Oops: 0000 [1] > > > SMP > > > > > > last sysfs file: /class/scsi_host/host0/proc_name > > > CPU 0 > > > > > > Modules linked in: > > > xt_physdev > > > netconsole > > > drbd(U) > > > netloop > > > netbk > > > blktap > > > blkbk > > > ipt_MASQUERADE > > > iptable_nat > > > ip_nat > > > bridge > > > ipv6 > > > xfrm_nalgo > > > crypto_api > > > xt_tcpudp > > > xt_state > > > ip_conntrack_irc > > > xt_conntrack > > > ip_conntrack_ftp > > > xt_mac > > > xt_length > > > xt_limit > > > xt_multiport > > > ipt_ULOG > > > ipt_TCPMSS > > > ipt_TOS > > > ipt_ttl > > > ipt_owner > > > ipt_REJECT > > > ipt_ecn > > > ipt_LOG > > > ipt_recent > > > ip_conntrack > > > iptable_mangle > > > iptable_filter > > > ip_tables > > > nfnetlink > > > x_tables > > > autofs4 > > > dm_mirror > > > dm_multipath > > > scsi_dh > > > video > > > hwmon > > > backlight > > > sbs > > > i2c_ec > > > i2c_core > > > button > > > battery > > > asus_acpi > > > ac > > > parport_pc > > > lp > > > parport > > > joydev > > > ide_cd > > > e1000e > > > cdrom > > > serial_core > > > i5000_edac > > > edac_mc > > > bnx2 > > > serio_raw > > > pcspkr > > > sg > > > dm_raid45 > > > dm_message > > > dm_region_hash > > > dm_log > > > dm_mod > > > dm_mem_cache > > > ata_piix > > > libata > > > shpchp > > > megaraid_sas > > > sd_mod > > > scsi_mod > > > ext3 > > > jbd > > > uhci_hcd > > > ohci_hcd > > > ehci_hcd > > > > > > Pid: 12887, comm: drbd0_receiver Tainted: G 2.6.18-128.1.16.el5xen #1 > > > RIP: e030:[<ffffffff80212bad>] > > > [<ffffffff80212bad>] csum_partial+0x56/0x4bc > > > RSP: e02b:ffff88000c347718 EFLAGS: 00010202 > > > RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff880010ced500 > > > RDX: 00000000000000e7 RSI: 000000000000039c RDI: ffff880011e3cc64 > > > RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 > > > R10: 0000000025b85e7c R11: 0000000000000002 R12: 0000000000000028 > > > R13: 0000000000000028 R14: ffff88001c56f7b0 R15: 0000000025b85e7c > > > FS: 00002b391e123f60(0000) GS:ffffffff805ba000(0000) > > > knlGS:0000000000000000 CS: e033 DS: 0000 ES: 0000 > > > Process drbd0_receiver (pid: 12887, threadinfo ffff88000c346000, task > > > ffff88001c207820) > > > Stack: > > > 000000000000039c > > > 00000000000005b4 > > > ffffffff8023d496 > > > ffff88001e7e48d8 > > > > > > 0000001400000000 > > > ffff8800000003c4 > > > ffff88001c56f7b0 > > > ffff88001e7e48d8 > > > > > > ffff88001e7e48ec > > > ffff88000c3478e8 > > > > > > Call Trace: > > > [<ffffffff8023d496>] skb_checksum+0x11b/0x260 > > > [<ffffffff80411472>] skb_checksum_help+0x71/0xd0 > > > [<ffffffff8853f33e>] :iptable_nat:ip_nat_fn+0x56/0x1c3 > > > [<ffffffff8853f6cf>] :iptable_nat:ip_nat_local_fn+0x32/0xb7 > > > [<ffffffff8023550c>] nf_iterate+0x41/0x7d > > > [<ffffffff8042f004>] dst_output+0x0/0xe > > > [<ffffffff80258b28>] nf_hook_slow+0x58/0xbc > > > [<ffffffff8042f004>] dst_output+0x0/0xe > > > [<ffffffff802359ab>] ip_queue_xmit+0x41c/0x48c > > > [<ffffffff8022c1cb>] local_bh_enable+0x9/0xa5 > > > [<ffffffff8020b6b7>] kmem_cache_alloc+0x62/0x6d > > > [<ffffffff8023668d>] alloc_skb_from_cache+0x74/0x13c > > > [<ffffffff80222a0b>] tcp_transmit_skb+0x62f/0x667 > > > [<ffffffff8043903a>] tcp_retransmit_skb+0x53d/0x638 > > > [<ffffffff80439353>] tcp_xmit_retransmit_queue+0x21e/0x2bb > > > [<ffffffff80225cff>] tcp_ack+0x1705/0x1879 > > > [<ffffffff8021c6b1>] tcp_rcv_established+0x804/0x925 > > > [<ffffffff80263710>] schedule_timeout+0x1e/0xad > > > [<ffffffff8023cef3>] tcp_v4_do_rcv+0x2a/0x2fa > > > [<ffffffff8040bbfe>] sk_wait_data+0xac/0xbf > > > [<ffffffff8029b018>] autoremove_wake_function+0x0/0x2e > > > [<ffffffff80434f71>] tcp_prequeue_process+0x65/0x78 > > > [<ffffffff8021dd39>] tcp_recvmsg+0x492/0xb1f > > > [<ffffffff80233102>] sock_common_recvmsg+0x2d/0x43 > > > [<ffffffff80233102>] sock_common_recvmsg+0x2d/0x43 > > > [<ffffffff80231c18>] sock_recvmsg+0x101/0x120 > > > [<ffffffff80231c18>] sock_recvmsg+0x101/0x120 > > > [<ffffffff8029b018>] autoremove_wake_function+0x0/0x2e > > > [<ffffffff80343366>] swiotlb_map_sg+0xf7/0x205 > > > [<ffffffff880b563c>] :megaraid_sas:megasas_make_sgl64+0x78/0xa9 > > > [<ffffffff880b61bc>] :megaraid_sas:megasas_queue_command+0x343/0x3ed > > > [<ffffffff884e119f>] :drbd:drbd_recv+0x7b/0x109 > > > [<ffffffff884e53b2>] :drbd:receive_DataRequest+0x3b/0x655 > > > [<ffffffff884e1c4b>] :drbd:drbdd+0x77/0x152 > > > [<ffffffff884e4870>] :drbd:drbdd_init+0xea/0x1dc > > > [<ffffffff884f432a>] :drbd:drbd_thread_setup+0xa2/0x18b > > > [<ffffffff80260b2c>] child_rip+0xa/0x12 > > > [<ffffffff884f4288>] :drbd:drbd_thread_setup+0x0/0x18b > > > [<ffffffff80260b22>] child_rip+0x0/0x12 > > > > > > > > > Code: > > > 44 > > > 8b > > > 0f > > > ff > > > ca > > > 83 > > > ee > > > 04 > > > 48 > > > 83 > > > c7 > > > 04 > > > 4d > > > 01 > > > c8 > > > 41 > > > 89 > > > d2 > > > 41 > > > 89 > > > > > > RIP > > > [<ffffffff80212bad>] csum_partial+0x56/0x4bc > > > RSP <ffff88000c347718> > > > CR2: ffff880011e3cc64 > > > > > > Kernel panic - not syncing: Fatal exception > > > ####### > > > > > > > > > Any ideas on how to diagnose this properly and eventually find the > > > culprit? > > > > > > > > > Regards, > > > > _______________________________________________ > > drbd-user mailing list > > drbd-user at lists.linbit.com > > http://lists.linbit.com/mailman/listinfo/drbd-user > > Jean-Francois, > > thank you for this very elaborate and technically rich reply. I will > certainly look into your suggestions about using Broadcom cards. I have > one dual port Broadcom card in this server, but I was using one port > combined with one port on an Intel e1000 dual port NIC in balanced-rr to > provide for backup in the event a NIC goes down. Two port NICs usually > share one chip for two ports, so in case of a problem with the chip, the > complete DRBD would be out. Reality shows this might be a bad idea though: > doing a bonnie++ test to the backend storage (RAID5 on 15K rpm disks) > gives me a 255 MB/sec write performance, doing the same test on the DRBD > device drops this to 77 MB/sec, even with the MTU set to 9000. It would be > nice to get as close as possible to the theoretical maximum, so a lot > needs to be done to get there. > Step 1 would be changing everything to the broadcom NIC. Any other > suggestions? > > Thanks a lot, > > Bart > > > > _______________________________________________ > drbd-user mailing list > drbd-user at lists.linbit.com > http://lists.linbit.com/mailman/listinfo/drbd-user Oops, complety wrong thread. Please disregard ... B.