Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi, Has anyone else run across issues where you get kernel panics when a certain number of resources are added to a DRBD setup? We have a DRBD setup that we are trying to use 10 resources (moving to 18) under XEN DOM0 and using infiniband as the interconnect. As soon as both DRBD systems go active and start the sync process one or both servers will kernel panic. If we put a skip {} block around any one (or more) of the resource entries DRBD will sync the drives and every thing works just fine. We use the fencing dont-care; option in drbd because of the primary/ primary issue with the version of drbd we are using. Version: 8.3.0 (api:88) This also happens with the vanilla xen kernel, the one below just adds in MD raid patch. The closest issue that is similar to our situation seems to be ethernet driver related (https://bugzilla.redhat.com/show_bug.cgi?id=476897 ) since they throw a similar error and it is using a large MTU. kernel panic stack backtrace (I can not get kdump to work under XEN DOM0 so this is from a serial console capture) Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: [<ffffffff8027cc53>] xen_destroy_contiguous_region+0x83/0x3d6 PGD 5bb3f8067 PUD 5bad6b067 PMD 0 Oops: 0002 [1] SMP last sysfs file: /devices/pci0000:00/0000:00:00.0/irq CPU 1 Modules linked in: drbd(U) vsd(U) xt_physdev netloop netbk blktap blkbk ipt_MASQUERADE iptable_nat ip_nat xt_state ip_conntrack nfnetlink ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge autofs4 nfs lockd fscache nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table rdma_ucm(U) qlgc_vnic(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ipoib_helper(U) ib_cm(U) ib_sa(U) ipv6 xfrm_nalgo crypto_api ib_uverbs(U) ib_umad(U) iw_cxgb3(U) cxgb3(U) ib_ipath(U) mlx4_ib(U) mlx4_core(U) dm_multipath scsi_dh video hwmon backlight sbs i2c_ec button battery asus_acpi ac parport_pc lp parport joydev sr_mod sg i5000_edac i2c_i801 e1000e edac_mc i2c_core ib_mthca (U) ide_cd ib_mad(U) ib_core(U) serial_core pcspkr serio_raw cdrom dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod usb_storage qla2xxx scsi_transport_fc ata_piix libata shpchp mptsas mptscsih mptbase scsi_transport_sas sd_mod scsi_mod raid1 ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 4724, comm: ib_cm/1 Tainted: G 2.6.18-128.el5.bsi.01xen #1 RIP: e030:[<ffffffff8027cc53>] [<ffffffff8027cc53>] xen_destroy_contiguous_region+0x83/0x3d6 RSP: e02b:ffff8805c69ef770 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000001000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: 0000000000000000 R08: ffffffff8068fa40 R09: 0000000000000000 R10: ffff8805c69ef770 R11: 0000000000000048 R12: 0000000000000001 R13: 0000000000000005 R14: 0000000000008000 R15: ffff8802fa52d400 FS: 00002b5f13ed52b0(0000) GS:ffffffff805ba080(0000) knlGS: 0000000000000000 CS: e033 DS: 0000 ES: 0000 Process ib_cm/1 (pid: 4724, threadinfo ffff8805c69ee000, task ffff8805cc46d100) Stack: ffff8805c69ef7b8 0000000000000001 0000000000000000 0000000000007ff0 ffffffff8068ea40 0000000000000001 0000000000000000 0000000000007ff0 0000000000000000 ffffffff804eac80 Call Trace: [<ffffffff80271292>] dma_free_coherent+0x69/0x77 [<ffffffff883b596e>] :ib_mthca:mthca_buf_free+0x73/0x9c [<ffffffff883b5ef8>] :ib_mthca:mthca_buf_alloc+0x273/0x297 [<ffffffff802629d6>] mutex_lock+0xd/0x1d [<ffffffff883baa6d>] :ib_mthca:mthca_alloc_qp_common+0x23d/0x517 [<ffffffff8025d6e8>] del_timer_sync+0xc/0x16 [<ffffffff883bb08c>] :ib_mthca:mthca_alloc_qp+0xab/0x106 [<ffffffff883bf9f7>] :ib_mthca:mthca_create_qp+0x12d/0x28e [<ffffffff883b2453>] :ib_mthca:mthca_cmd_wait+0x183/0x1d7 [<ffffffff8836ef48>] :ib_core:ib_create_qp+0x17/0xb4 [<ffffffff887009e4>] :rdma_cm:rdma_create_qp+0x2d/0x153 [<ffffffff803a345c>] dma_pool_free+0x83/0x144 [<ffffffff8020b7bf>] kfree+0x15/0xc5 [<ffffffff883b879f>] :ib_mthca:mthca_init_cq+0x2f5/0x39f [<ffffffff883bfc50>] :ib_mthca:mthca_create_cq+0xf8/0x1c8 [<ffffffff88716354>] :ib_sdp:sdp_completion_handler+0x0/0xc [<ffffffff88714904>] :ib_sdp:sdp_cq_event_handler+0x0/0x1 [<ffffffff8836f00c>] :ib_core:ib_create_cq+0x27/0x55 [<ffffffff88714c27>] :ib_sdp:sdp_init_qp+0x321/0x43a [<ffffffff88714905>] :ib_sdp:sdp_qp_event_handler+0x0/0x1 [<ffffffff8871551d>] :ib_sdp:sdp_cma_handler+0x4d2/0x1309 [<ffffffff886fd797>] :rdma_cm:cma_acquire_dev+0xec/0x113 [<ffffffff8871504b>] :ib_sdp:sdp_cma_handler+0x0/0x1309 [<ffffffff8870015d>] :rdma_cm:cma_req_handler+0x30a/0x3c3 [<ffffffff886abc7d>] :ib_cm:cm_process_work+0x48/0x97 [<ffffffff886ad076>] :ib_cm:cm_req_handler+0x832/0x89f [<ffffffff886ad0e3>] :ib_cm:cm_work_handler+0x0/0xa9f [<ffffffff886ad113>] :ib_cm:cm_work_handler+0x30/0xa9f [<ffffffff886ad0e3>] :ib_cm:cm_work_handler+0x0/0xa9f [<ffffffff8024ee11>] run_workqueue+0x94/0xe4 [<ffffffff8024b71a>] worker_thread+0x0/0x122 [<ffffffff80299db3>] keventd_create_kthread+0x0/0xc4 [<ffffffff8024b80a>] worker_thread+0xf0/0x122 [<ffffffff80286daf>] default_wake_function+0x0/0xe [<ffffffff80299db3>] keventd_create_kthread+0x0/0xc4 [<ffffffff80299db3>] keventd_create_kthread+0x0/0xc4 [<ffffffff80233476>] kthread+0xfe/0x132 [<ffffffff8025fb2c>] child_rip+0xa/0x12 [<ffffffff80299db3>] keventd_create_kthread+0x0/0xc4 [<ffffffff80233378>] kthread+0x0/0x132 [<ffffffff8025fb22>] child_rip+0x0/0x12 Code: f3 aa 48 c7 c7 80 31 53 80 e8 8f 6d fe ff 49 89 c3 48 b8 ff RIP [<ffffffff8027cc53>] xen_destroy_contiguous_region+0x83/0x3d6 RSP <ffff8805c69ef770> CR2: 0000000000000000 <0>Kernel panic - not syncing: Fatal exception (XEN) Domain 0 crashed: rebooting machine in 5 seconds. Our drbd.cfg global { usage-count no; } common { protocol C; net { timeout 60; max-epoch-size 2048; max-buffers 2048; unplug-watermark 128; connect-int 10; ping-int 10; sndbuf-size 32764; ko-count 2; ping-timeout 10; allow-two-primaries; after-sb-0pri discard-zero-changes; after-sb-1pri discard-secondary; after-sb-2pri disconnect; } startup { wfc-timeout 60; degr-wfc-timeout 15; become-primary-on both; } handlers { local-io-error "/usr/lib/drbd/brtHandler.pl local-io- error"; pri-on-incon-degr "/usr/lib/drbd/brtHandler.pl pri- on-incon-degr"; pri-lost-after-sb "/usr/lib/drbd/brtHandler.pl pri- lost-after-sb"; pri-lost "/usr/lib/drbd/brtHandler.pl pri-lost"; split-brain "/usr/lib/drbd/brtHandler.pl split-brain"; before-resync-target "/usr/lib/drbd/brtHandler.pl before-resync-target"; after-resync-target "/usr/lib/drbd/brtHandler.pl after-resync-target"; out-of-sync "/usr/lib/drbd/brtHandler.pl out-of-sync"; fence-peer "/usr/lib/drbd/brtHandler.pl fence-peer"; outdate-peer "/usr/lib/drbd/brtHandler.pl outdate- peer"; } disk { fencing dont-care; max-bio-bvecs 1; no-disk-flushes; no-md-flushes; no-disk-barrier; no-disk-drain; on-io-error call-local-io-error; } } resource ol01 { device /dev/drbd1; disk /dev/vsdb; meta-disk internal; on g2-0937-xxxx-1host-1 { address sci 1.53.240.1:7810; } on g2-0937-xxxx-1host-2 { address sci 1.53.240.2:7810; } } skipping resource ol02 - ol08, and nl01 resource nl02 { device /dev/drbd10; disk /dev/vsdk; meta-disk internal; on g2-0937-xxxx-1host-1 { address sci 1.53.240.1:7819; } on g2-0937-xxxx-1host-2 { address sci 1.53.240.2:7819; } } ifcfg-ib0 DEVICE=ib0 BOOTPROTO=static IPADDR=1.53.240.1 NETMASK=255.255.240.0 BROADCAST=1.53.15.255 ONBOOT=yes ifconfig ib0 Link encap:InfiniBand HWaddr 80:00:04:04:FE: 80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 inet addr:1.53.240.1 Bcast:1.53.15.255 Mask:255.255.240.0 inet6 addr: fe80::202:c902:22:b9a9/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1 RX packets:153 errors:0 dropped:0 overruns:0 frame:0 TX packets:153 errors:0 dropped:5 overruns:0 carrier:0 collisions:0 txqueuelen:256 RX bytes:8568 (8.3 KiB) TX bytes:9188 (8.9 KiB) lspci -v 06:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx HCA] (rev a0) Subsystem: Mellanox Technologies MT25204 [InfiniHost III Lx HCA] Flags: bus master, fast devsel, latency 0, IRQ 20 Memory at b9100000 (64-bit, non-prefetchable) [size=1M] Memory at b8000000 (64-bit, prefetchable) [size=8M] Capabilities: [40] Power Management version 2 Capabilities: [48] Vital Product Data Capabilities: [90] Message Signalled Interrupts: 64bit+ Queue=0/5 Enable- Capabilities: [84] MSI-X: Enable+ Mask- TabSize=32 Capabilities: [60] Express Endpoint IRQ 0