[DRBD-user] linstor-proxmox hangs forever because of tainted kernel
Alexander Karamanlidis
alexander.karamanlidis at lindenbaum.eu
Tue May 28 18:16:51 CEST 2019
Hi Everyone,
we're having the same issue described in here:
http://lists.linbit.com/pipermail/drbd-user/2018-November/024667.html in
the exact
same constellation. Sadly, no one aswered to that Thread so i'm opening
a new one.
Packages installed on both nodes:
root at node2:~# dpkg -l | grep 'drbd\|linstor'
ii drbd-dkms 9.0.17-1
all RAID 1 over TCP/IP for Linux module source
ii drbd-utils 9.9.0-1
amd64 RAID 1 over TCP/IP for Linux (user utilities)
ii drbdtop 0.2.1-1
amd64 like top, but for drbd
ii linstor-client 0.9.5-1
all Linstor client command line tool
ii linstor-common 0.9.10-1
all DRBD distributed resource management utility
ii linstor-proxmox 3.0.3-1
all DRBD distributed resource management utility
ii linstor-satellite 0.9.10-1
all DRBD distributed resource management utility
ii python-linstor 0.9.5-1
all Linstor python api library
Proxmox Error Message:
()
SUCCESS:
Description:
New resource definition 'vm-102-state-test123' created.
Details:
Resource definition 'vm-102-state-test123' UUID is:
7eab2e30-5a5a-4e57-a4e2-d5da7c58f770
SUCCESS:
Description:
Resource definition 'vm-102-state-test123' modified.
Details:
Resource definition 'vm-102-state-test123' UUID is:
7eab2e30-5a5a-4e57-a4e2-d5da7c58f770
SUCCESS:
New volume definition with number '0' of resource definition
'vm-102-state-test123' created.
SUCCESS:
Description:
Resource 'vm-102-state-test123' successfully autoplaced on 2 nodes
Details:
Used storage pool: 'drbdpool'
Used nodes: 'node1', 'node2'
SUCCESS:
Created resource 'vm-102-state-test123' on 'node2'
SUCCESS:
Created resource 'vm-102-state-test123' on 'node1'
SUCCESS:
Description:
Resource 'vm-102-state-test123' on 'node2' ready
Details:
Auto-placing resource: vm-102-state-test123
SUCCESS:
Description:
Resource 'vm-102-state-test123' on 'node1' ready
Details:
Auto-placing resource: vm-102-state-test123
SUCCESS:
Description:
New snapshot 'snap_vm-102-disk-1_test123' of resource
'vm-102-disk-1' registered.
Details:
Snapshot 'snap_vm-102-disk-1_test123' of resource 'vm-102-disk-1'
UUID is: 832dbdf2-b3dc-46f9-8f1d-ac9d20c1790c
SUCCESS:
Suspended IO of 'vm-102-disk-1' on 'node2' for snapshot
SUCCESS:
Suspended IO of 'vm-102-disk-1' on 'node1' for snapshot
ERROR:
Description:
(Node: 'node2') Preparing resources for layer StorageLayer failed
Cause:
External command timed out
Details:
External command: lvs -o
lv_name,lv_path,lv_size,vg_name,pool_lv,data_percent,lv_attr --separator
; --noheadings --units k --nosuffix drbdpool
ERROR:
Description:
(Node: 'node1') Preparing resources for layer StorageLayer failed
Cause:
External command timed out
Details:
External command: lvs -o
lv_name,lv_path,lv_size,vg_name,pool_lv,data_percent,lv_attr --separator
; --noheadings --units k --nosuffix drbdpool
ERROR:
Description:
(Node: 'node2') Preparing resources for layer StorageLayer failed
Cause:
External command timed out
Details:
External command: lvs -o
lv_name,lv_path,lv_size,vg_name,pool_lv,data_percent,lv_attr --separator
; --noheadings --units k --nosuffix drbdpool
ERROR:
Description:
(Node: 'node1') Preparing resources for layer StorageLayer failed
Cause:
External command timed out
Details:
External command: lvs -o
lv_name,lv_path,lv_size,vg_name,pool_lv,data_percent,lv_attr --separator
; --noheadings --units k --nosuffix drbdpool
VM 102 qmp command 'savevm-end' failed - unable to connect to VM 102 qmp
socket - timeout after 5992 retries
snapshot create failed: starting cleanup
error with cfs lock 'storage-drbdpool': Could not remove
vm-102-state-test123: got lock timeout - aborting command
TASK ERROR: Could not create cluster wide snapshot for: vm-102-disk-1:
exit code 10
Haning Tasks:
root at node2:~# systemctl status linstor-satellite.service
● linstor-satellite.service - LINSTOR Satellite Service
Loaded: loaded (/lib/systemd/system/linstor-satellite.service;
enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/linstor-satellite.service.d
└─override.conf
Active: active (running) since Tue 2019-05-28 17:42:11 CEST; 30min ago
Main PID: 4612 (java)
Tasks: 303 (limit: 14745)
Memory: 263.1M
CPU: 24.237s
CGroup: /system.slice/linstor-satellite.service
├─ 4612 java -Xms32M -XX:MinHeapFreeRatio=40
-XX:MaxHeapFreeRatio=70 -classpath
/usr/share/linstor-server/lib/satellite-0.9.10.jar:/usr/share/linstor-server/lib/reactive-streams-1.0.2.jar:/usr/share/linstor-server/lib/logback-classic-1.2.3.ja
├─ 5406 drbdsetup events2 all
├─16954 lvs -o
lv_name,lv_path,lv_size,vg_name,pool_lv,data_percent,lv_attr --separator
; --noheadings --units k --nosuffix drbdpool
├─17236 vgs -o vg_name --noheadings
├─17525 lvs -o
lv_name,lv_path,lv_size,vg_name,pool_lv,data_percent,lv_attr --separator
; --noheadings --units k --nosuffix drbdpool
├─17788 vgs -o vg_name --noheadings
├─18075 vgs -o vg_name --noheadings
├─18323 vgs -o vg_name --noheadings
├─18612 vgs -o vg_name --noheadings
├─18889 vgs -o vg_name --noheadings
├─19162 vgs -o vg_name --noheadings
├─19442 vgs -o vg_name --noheadings
├─19700 vgs -o vg_name --noheadings
├─19984 vgs -o vg_name --noheadings
├─20242 vgs -o vg_name --noheadings
├─20523 lvs -o
lv_name,lv_path,lv_size,vg_name,pool_lv,data_percent,lv_attr --separator
; --noheadings --units k --nosuffix drbdpool
├─20779 vgs -o vg_name --noheadings
├─21047 lvs -o
lv_name,lv_path,lv_size,vg_name,pool_lv,data_percent,lv_attr --separator
; --noheadings --units k --nosuffix drbdpool
├─21310 vgs -o vg_name --noheadings
├─21572 lvs -o
lv_name,lv_path,lv_size,vg_name,pool_lv,data_percent,lv_attr --separator
; --noheadings --units k --nosuffix drbdpool
├─21821 vgs -o vg_name --noheadings
├─22083 lvs -o
lv_name,lv_path,lv_size,vg_name,pool_lv,data_percent,lv_attr --separator
; --noheadings --units k --nosuffix drbdpool
├─22346 vgs -o vg_name --noheadings
├─22612 lvs -o
lv_name,lv_path,lv_size,vg_name,pool_lv,data_percent,lv_attr --separator
; --noheadings --units k --nosuffix drbdpool
├─22864 vgs -o vg_name --noheadings
└─23135 lvs -o
lv_name,lv_path,lv_size,vg_name,pool_lv,data_percent,lv_attr --separator
; --noheadings --units k --nosuffix drbdpool
Kernel Stack Trace:
[ 967.562965] INFO: task kvm:18314 blocked for more than 120 seconds.
[ 967.563540] Tainted: P O 4.15.18-14-pve #1
[ 967.563974] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 967.564414] kvm D 0 18314 1 0x00000000
[ 967.564419] Call Trace:
[ 967.564435] __schedule+0x3e0/0x870
[ 967.564440] ? touch_pmd+0x32/0x40
[ 967.564442] schedule+0x36/0x80
[ 967.564460] __drbd_make_request+0x27c/0x5b0 [drbd]
[ 967.564465] ? wait_woken+0x80/0x80
[ 967.564472] drbd_make_request+0x4f/0x80 [drbd]
[ 967.564477] generic_make_request+0x123/0x2f0
[ 967.564479] submit_bio+0x73/0x140
[ 967.564480] ? submit_bio+0x73/0x140
[ 967.564483] blkdev_direct_IO+0x3c0/0x400
[ 967.564486] ? do_io_getevents+0xd0/0xd0
[ 967.564489] ? generic_update_time+0x7b/0xd0
[ 967.564490] ? current_time+0x32/0x70
[ 967.564494] generic_file_read_iter+0xb1/0xbb0
[ 967.564496] ? generic_file_read_iter+0xb1/0xbb0
[ 967.564498] ? __check_object_size+0xb3/0x190
[ 967.564500] ? _copy_from_user+0x36/0x70
[ 967.564502] blkdev_read_iter+0x35/0x40
[ 967.564503] aio_read+0xfc/0x140
[ 967.564505] do_io_submit+0x530/0x6c0
[ 967.564506] ? do_io_submit+0x530/0x6c0
[ 967.564508] SyS_io_submit+0x10/0x20
[ 967.564509] ? SyS_io_submit+0x10/0x20
[ 967.564512] do_syscall_64+0x73/0x130
[ 967.564514] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[ 967.564517] RIP: 0033:0x7f1f217d6717
[ 967.564517] RSP: 002b:00007f1de9999558 EFLAGS: 00000246 ORIG_RAX:
00000000000000d1
[ 967.564519] RAX: ffffffffffffffda RBX: 0000000000000001 RCX:
00007f1f217d6717
[ 967.564520] RDX: 00007f1de9999570 RSI: 0000000000000001 RDI:
00007f1f23117000
[ 967.564521] RBP: 00007f1de9999570 R08: 0000000000000008 R09:
0000000000000000
[ 967.564521] R10: 00007f1efc94ba60 R11: 0000000000000246 R12:
00007f1de9999560
[ 967.564522] R13: 00007f1efc94ba38 R14: 00007f1efc94ba10 R15:
000000114f118a00
[ 967.564531] INFO: task lvs:18730 blocked for more than 120 seconds.
[ 967.564975] Tainted: P O 4.15.18-14-pve #1
[ 967.565425] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 967.565894] lvs D 0 18730 5826 0x00000004
[ 967.565896] Call Trace:
[ 967.565898] __schedule+0x3e0/0x870
[ 967.565900] schedule+0x36/0x80
[ 967.565907] __drbd_make_request+0x27c/0x5b0 [drbd]
[ 967.565909] ? wait_woken+0x80/0x80
[ 967.565915] drbd_make_request+0x4f/0x80 [drbd]
[ 967.565917] generic_make_request+0x123/0x2f0
[ 967.565918] submit_bio+0x73/0x140
[ 967.565920] ? submit_bio+0x73/0x140
[ 967.565921] __blkdev_direct_IO_simple+0x1b1/0x380
[ 967.565922] ? __blkdev_get+0x240/0x4d0
[ 967.565924] ? bdget+0x120/0x120
[ 967.565925] blkdev_direct_IO+0x3f1/0x400
[ 967.565926] ? blkdev_direct_IO+0x3f1/0x400
[ 967.565929] ? dput+0xc7/0x1f0
[ 967.565931] ? mntput+0x24/0x40
[ 967.565932] ? terminate_walk+0x91/0xf0
[ 967.565934] ? path_openat+0x6f7/0x14a0
[ 967.565935] ? putname+0x54/0x60
[ 967.565936] generic_file_read_iter+0xb1/0xbb0
[ 967.565938] ? generic_file_read_iter+0xb1/0xbb0
[ 967.565944] ? kref_debug_sub+0x3c/0x60 [drbd]
[ 967.565946] ? _copy_to_user+0x2a/0x40
[ 967.565948] ? cp_new_stat+0x152/0x180
[ 967.565949] blkdev_read_iter+0x35/0x40
[ 967.565951] new_sync_read+0xe4/0x130
[ 967.565952] __vfs_read+0x29/0x40
[ 967.565954] vfs_read+0x96/0x130
[ 967.565955] SyS_read+0x55/0xc0
[ 967.565957] do_syscall_64+0x73/0x130
[ 967.565958] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[ 967.565959] RIP: 0033:0x7f1d049a9910
[ 967.565960] RSP: 002b:00007ffec8316918 EFLAGS: 00000246 ORIG_RAX:
0000000000000000
[ 967.565961] RAX: ffffffffffffffda RBX: 0000000000000000 RCX:
00007f1d049a9910
[ 967.565962] RDX: 0000000000001000 RSI: 0000556670ae3000 RDI:
0000000000000004
[ 967.565963] RBP: 00007ffec8316970 R08: 00007f1d04c68258 R09:
0000000000001000
[ 967.565963] R10: 0000000000000030 R11: 0000000000000246 R12:
0000556670ae3000
[ 967.565964] R13: 0000000000000000 R14: 0000000000000004 R15:
00007ffec83169d0
[ 967.565968] INFO: task lvs:18773 blocked for more than 120 seconds.
[ 967.566433] Tainted: P O 4.15.18-14-pve #1
[ 967.566944] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 967.567435] lvs D 0 18773 18766 0x00000004
[ 967.567438] Call Trace:
[ 967.567441] __schedule+0x3e0/0x870
[ 967.567443] schedule+0x36/0x80
[ 967.567450] __drbd_make_request+0x27c/0x5b0 [drbd]
[ 967.567452] ? wait_woken+0x80/0x80
[ 967.567458] drbd_make_request+0x4f/0x80 [drbd]
[ 967.567460] generic_make_request+0x123/0x2f0
[ 967.567462] submit_bio+0x73/0x140
[ 967.567464] ? submit_bio+0x73/0x140
[ 967.567465] __blkdev_direct_IO_simple+0x1b1/0x380
[ 967.567467] ? __blkdev_get+0x240/0x4d0
[ 967.567468] ? bdget+0x120/0x120
[ 967.567470] blkdev_direct_IO+0x3f1/0x400
[ 967.567471] ? blkdev_direct_IO+0x3f1/0x400
[ 967.567472] ? dput+0xc7/0x1f0
[ 967.567474] ? mntput+0x24/0x40
[ 967.567476] ? terminate_walk+0x91/0xf0
[ 967.567478] ? path_openat+0x6f7/0x14a0
[ 967.567479] ? putname+0x54/0x60
[ 967.567480] generic_file_read_iter+0xb1/0xbb0
[ 967.567482] ? generic_file_read_iter+0xb1/0xbb0
[ 967.567488] ? kref_debug_sub+0x3c/0x60 [drbd]
[ 967.567490] ? _copy_to_user+0x2a/0x40
[ 967.567492] ? cp_new_stat+0x152/0x180
[ 967.567493] blkdev_read_iter+0x35/0x40
[ 967.567494] new_sync_read+0xe4/0x130
[ 967.567497] __vfs_read+0x29/0x40
[ 967.567499] vfs_read+0x96/0x130
[ 967.567500] SyS_read+0x55/0xc0
[ 967.567502] do_syscall_64+0x73/0x130
[ 967.567504] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[ 967.567505] RIP: 0033:0x7f313938e910
[ 967.567506] RSP: 002b:00007fffa10d77c8 EFLAGS: 00000246 ORIG_RAX:
0000000000000000
[ 967.567507] RAX: ffffffffffffffda RBX: 0000000000000000 RCX:
00007f313938e910
[ 967.567508] RDX: 0000000000001000 RSI: 0000564dbe7da000 RDI:
0000000000000004
[ 967.567509] RBP: 00007fffa10d7820 R08: 00007f313964d248 R09:
0000000000001000
[ 967.567509] R10: 0000000000000030 R11: 0000000000000246 R12:
0000564dbe7da000
[ 967.567510] R13: 0000000000000000 R14: 0000000000000004 R15:
00007fffa10d7880
We also have LVM_THIN Storage Pools.
Maybe there's anybody that could help us and resolve this Thread and the
one before.
Kind Regards,
Alex
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20190528/d974ac08/attachment-0001.htm>
More information about the drbd-user
mailing list