[DRBD-user] Spin_Lock timeout in DRBD during heavy load
stefan.flothkoetter
stefan.flothkoetter at xelistra.com
Thu May 16 16:49:08 CEST 2019
Hi!
I'm frequently seeing spin_lock timeouts during heavy load caused by a sync:
Using current CentOS7 with Kernel 3.10.0-957.12.1.el7.x86_64 and DRBD
9.0.16-1:
[ 2636.004640] NMI watchdog: Watchdog detected hard LOCKUP on cpu 1
[ 2636.074353] Modules linked in: tun fuse udp_diag tcp_diag inet_diag
drbd_transport_tcp(OE) drbd(OE) binfmt_misc nf_conntrack_netlink devlink
bnx2fc cnic uio fcoe libfcoe libfc scsi_transport_fc scsi_tgt
scsi_transport_iscsi ip6t_MASQUERADE nf_nat_masquerade_ipv6
ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netbios_ns
nf_conntrack_broadcast ip6t_rpfilter ipt_REJECT nf_reject_ipv4
ip6t_REJECT nf_reject
_ipv6 xt_conntrack geneve openvswitch ip_set nfnetlink ebtable_nat
ebtable_broute ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6
ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conn
track_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle
iptable_security iptable_raw ebtable_filter ebtables ip6table_filter
ip6_tables vxlan ip6_udp_tunnel udp_tunnel iptable_filter sunrpc
vfat fat xfs libcrc32c skx_edac intel_powerclamp coretemp intel_rapl
iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel
aesni_intel lrw gf128mul glue_helper ablk_helper cryptd dm_service_tim
e iTCO_wdt iTCO_vendor_support dell_smbios dcdbas dell_wmi_descriptor
lpc_ich pcspkr sg i2c_i801 joydev ipmi_si mei_me ipmi_devintf
ipmi_msghandler mei wmi dm_multipath tpm_crb acpi_pad acpi_power_meter ip_
tables ext4 mbcache jbd2 mgag200 drm_kms_helper syscopyarea sysfillrect
sysimgblt sr_mod fb_sys_fops cdrom ttm sd_mod crc_t10dif
crct10dif_generic uas ahci igb libahci drm crct10dif_pclmul i40e(OE)
nvme crc
t10dif_common ptp crc32c_intel megaraid_sas usb_storage
drm_panel_orientation_quirks nvme_core libata pps_core dca i2c_algo_bit
nfit libnvdimm dm_mirror dm_region_hash dm_log dm_mod 8021q garp mrp
bridge st
p llc bonding dummy
[ 2637.838138] CPU: 1 PID: 39224 Comm: drbd_r_iscsi-ss Kdump: loaded
Tainted: G OE ------------ 3.10.0-957.12.1.el7.x86_64 #1
[ 2637.985386] Hardware name: /0W23H8, BIOS 2.1.8 04/30/2019
[ 2638.053938] task: ffff8a08dfb7c100 ti: ffff8a39d6398000 task.ti:
ffff8a39d6398000
[ 2638.144194] RIP: 0010:[<ffffffffc0b6251b>] [<ffffffffc0b6251b>]
receive_Data+0x78b/0x18f0 [drbd]
[ 2638.251259] RSP: 0018:ffff8a39d639bd18 EFLAGS: 00000246
[ 2638.315589] RAX: 0000000000000000 RBX: ffff8a39f6ae8000 RCX:
00000000000004d5
[ 2638.401699] RDX: 0000000000000001 RSI: ffff8a39f172a3c0 RDI:
ffff8a39d4a68310
[ 2638.487795] RBP: ffff8a39d639be00 R08: ffff8a39f6ae83e8 R09:
0000000000000077
[ 2638.573890] R10: ffff8a38d0b98678 R11: ffff8a0af5f44800 R12:
0000000000000000
[ 2638.659983] R13: ffff8a39d4a68000 R14: ffff8a39d639bd80 R15:
ffff8a38d0b986b8
[ 2638.746084] FS: 0000000000000000(0000) GS:ffff8a39fbe00000(0000)
knlGS:0000000000000000
[ 2638.843620] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2638.913124] CR2: 00007f1c50795df0 CR3: 000000078e810000 CR4:
00000000007607e0
[ 2638.999251] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 2639.085343] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[ 2639.171440] PKRU: 00000000
[ 2639.204633] Call Trace:
[ 2639.234699] [<ffffffffc09bc7f1>] ? dtt_recv+0xe1/0x190
[drbd_transport_tcp]
[ 2639.319683] [<ffffffffc0b69599>] drbd_receiver+0x479/0x7a0 [drbd]
[ 2639.394278] [<ffffffffc0b768b0>] ? do_retry+0x290/0x290 [drbd]
[ 2639.465758] [<ffffffffc0b7694e>] drbd_thread_setup+0x9e/0x1d0 [drbd]
[ 2639.543470] [<ffffffffc0b768b0>] ? do_retry+0x290/0x290 [drbd]
[ 2639.614935] [<ffffffff90ec1c71>] kthread+0xd1/0xe0
[ 2639.673958] [<ffffffff90ec1ba0>] ? insert_kthread_work+0x40/0x40
[ 2639.747527] [<ffffffff91575bdd>] ret_from_fork_nospec_begin+0x7/0x21
[ 2639.825244] [<ffffffff90ec1ba0>] ? insert_kthread_work+0x40/0x40
[ 2639.898792] Code: b7 3c 02 00 00 41 01 c6 44 89 b5 58 ff ff ff 4d 8d
b7 10 03 00 00 4c 89 f7 e8 32 91 a0 d0 49 8b b7 30 03 00 00 4c 89 f7 8b
4e 58 <8b> 76 64 c6 07 00 0f 1f 40 00 29 f1 39 8d 58 ff ff ff
48 8b 95
[ 2640.132445] Kernel panic - not syncing: Hard LOCKUP
[ 2640.191499] CPU: 1 PID: 39224 Comm: drbd_r_iscsi-ss Kdump: loaded
Tainted: G OE ------------ 3.10.0-957.12.1.el7.x86_64 #1
[ 2640.338684] Hardware name: /0W23H8, BIOS 2.1.8 04/30/2019
[ 2640.407120] Call Trace:
[ 2640.437181] <NMI> [<ffffffff91563021>] dump_stack+0x19/0x1b
[ 2640.506789] [<ffffffff9155c730>] panic+0xe8/0x21f
[ 2640.564875] [<ffffffff90e973df>] nmi_panic+0x3f/0x40
[ 2640.626070] [<ffffffff90f495a1>] watchdog_overflow_callback+0x121/0x140
[ 2640.706877] [<ffffffff90fa13e7>] __perf_event_overflow+0x57/0x100
[ 2640.781454] [<ffffffff90faaa54>] perf_event_overflow+0x14/0x20
[ 2640.852921] [<ffffffff90e0a6b0>] intel_pmu_handle_irq+0x220/0x510
[ 2640.927498] [<ffffffff91178028>] ? ioremap_page_range+0x2e8/0x480
[ 2641.002077] [<ffffffff90ffba34>] ? vunmap_page_range+0x234/0x470
[ 2641.075615] [<ffffffff9123d9d6>] ? ghes_copy_tofrom_phys+0x116/0x210
[ 2641.153204] [<ffffffff9123db70>] ? ghes_read_estatus+0xa0/0x190
[ 2641.225600] [<ffffffff9156c031>] perf_event_nmi_handler+0x31/0x50
[ 2641.300077] [<ffffffff9156d8fc>] nmi_handle.isra.0+0x8c/0x150
[ 2641.370399] [<ffffffff9156dbd8>] do_nmi+0x218/0x460
[ 2641.430348] [<ffffffff9156cd69>] end_repeat_nmi+0x1e/0x81
[ 2641.496526] [<ffffffff90f122fe>] ?
native_queued_spin_lock_slowpath+0x1ce/0x200
[ 2641.585439] [<ffffffff90f122fe>] ?
native_queued_spin_lock_slowpath+0x1ce/0x200
[ 2641.674335] [<ffffffff90f122fe>] ?
native_queued_spin_lock_slowpath+0x1ce/0x200
[ 2641.763229] <EOE> <IRQ> [<ffffffff9155d1ab>]
queued_spin_lock_slowpath+0xb/0xf
[ 2641.853381] [<ffffffff9156b6e7>] _raw_spin_lock_irqsave+0x37/0x40
[ 2641.927774] [<ffffffffc0b7357f>] drbd_rs_complete_io+0x3f/0x160 [drbd]
[ 2642.007354] [<ffffffffc0b4d8ac>]
drbd_endio_write_sec_final+0x3cc/0x460 [drbd]
[ 2642.095228] [<ffffffffc0b4db18>] drbd_peer_request_endio+0x1d8/0x420
[drbd]
[ 2642.179992] [<ffffffff9107fc87>] bio_endio+0x67/0xb0
[ 2642.240899] [<ffffffffc0324125>] dec_pending+0x1b5/0x350 [dm_mod]
[ 2642.315269] [<ffffffff90fb96c7>] ? mempool_free_slab+0x17/0x20
[ 2642.386547] [<ffffffffc0324df1>] clone_endio+0x71/0x100 [dm_mod]
[ 2642.459884] [<ffffffff9107fc87>] bio_endio+0x67/0xb0
[ 2642.520783] [<ffffffff91145a40>] blk_update_request+0x90/0x360
[ 2642.592064] [<ffffffff912dacd4>] scsi_end_request+0x34/0x1e0
[ 2642.661259] [<ffffffff912db048>] scsi_io_completion+0x168/0x6a0
[ 2642.733575] [<ffffffff912d9a41>] ? scsi_done+0x21/0x60
[ 2642.796545] [<ffffffff912d04fc>] scsi_finish_command+0xdc/0x140
[ 2642.868855] [<ffffffff912da592>] scsi_softirq_done+0x132/0x160
[ 2642.940127] [<ffffffff9114d366>] blk_done_softirq+0x96/0xc0
[ 2643.008288] [<ffffffff90ea0f45>] __do_softirq+0xf5/0x280
[ 2643.073335] [<ffffffff915792ec>] call_softirq+0x1c/0x30
[ 2643.137344] [<ffffffff90e2e675>] do_softirq+0x65/0xa0
[ 2643.199280] [<ffffffff90ea12c5>] irq_exit+0x105/0x110
[ 2643.261216] [<ffffffff9157a5a6>] do_IRQ+0x56/0xf0
[ 2643.318999] [<ffffffff9156c362>] common_interrupt+0x162/0x162
[ 2643.389221] <EOI> [<ffffffffc0b6251b>] ? receive_Data+0x78b/0x18f0
[drbd]
[ 2643.473042] [<ffffffffc0b6250e>] ? receive_Data+0x77e/0x18f0 [drbd]
[ 2643.549500] [<ffffffffc09bc7f1>] ? dtt_recv+0xe1/0x190
[drbd_transport_tcp]
[ 2643.634258] [<ffffffffc0b69599>] drbd_receiver+0x479/0x7a0 [drbd]
[ 2643.708632] [<ffffffffc0b768b0>] ? do_retry+0x290/0x290 [drbd]
[ 2643.779799] [<ffffffffc0b7694e>] drbd_thread_setup+0x9e/0x1d0 [drbd]
[ 2643.857191] [<ffffffffc0b768b0>] ? do_retry+0x290/0x290 [drbd]
[ 2643.928359] [<ffffffff90ec1c71>] kthread+0xd1/0xe0
[ 2643.987073] [<ffffffff90ec1ba0>] ? insert_kthread_work+0x40/0x40
[ 2644.060326] [<ffffffff91575bdd>] ret_from_fork_nospec_begin+0x7/0x21
[ 2644.137719] [<ffffffff90ec1ba0>] ? insert_kthread_work+0x40/0x40
I suspect this could fix it, but not sure:
diff --git a/drbd/drbd_receiver.c b/drbd/drbd_receiver.c
index 3b4c626..654aab6 100644
--- a/drbd/drbd_receiver.c
+++ b/drbd/drbd_receiver.c
@@ -2742,11 +2742,11 @@ prepare_activity_log(struct drbd_peer_request
*peer_req)
* See also drbd_request_prepare() for the "request" entry point. */
ecnt = atomic_add_return(nr_al_extents,
&device->wait_for_actlog_ecnt);
- spin_lock(&device->al_lock);
+ spin_lock_irq(&device->al_lock);
al = device->act_log;
nr = al->nr_elements;
used = al->used;
- spin_unlock(&device->al_lock);
+ spin_unlock_irq(&device->al_lock);
/* note: due to the slight delay between being accounted in
"used" after
* being committed to the activity log with
drbd_al_begin_io_commit(),
diff --git a/drbd/drbd_sender.c b/drbd/drbd_sender.c
index 5302628..e9ceaf8 100644
--- a/drbd/drbd_sender.c
+++ b/drbd/drbd_sender.c
@@ -2059,11 +2059,11 @@ skip_helper:
/* Forget potentially stale cached per resync extent
bit-counts.
* Open coded drbd_rs_cancel_all(device), we already
have IRQs
* disabled, and know the disk state is ok. */
- spin_lock(&device->al_lock);
+ spin_lock_irq(&device->al_lock);
lc_reset(peer_device->resync_lru);
peer_device->resync_locked = 0;
peer_device->resync_wenr = LC_FREE;
- spin_unlock(&device->al_lock);
+ spin_unlock_irq(&device->al_lock);
}
unlock_all_resources();
Viele Grüße,
Stefan
More information about the drbd-user
mailing list