[DRBD-user] linstor-proxmox hangs forever because of tainted kernel

Alexander Karamanlidis alexander.karamanlidis at lindenbaum.eu
Tue May 28 18:16:51 CEST 2019


Hi Everyone,


we're having the same issue described in here:
http://lists.linbit.com/pipermail/drbd-user/2018-November/024667.html in
the exact

same constellation. Sadly, no one aswered to that Thread so i'm opening
a new one.


Packages installed on both nodes:


root at node2:~# dpkg -l | grep 'drbd\|linstor'
ii  drbd-dkms                            9.0.17-1                      
all          RAID 1 over TCP/IP for Linux module source
ii  drbd-utils                           9.9.0-1                       
amd64        RAID 1 over TCP/IP for Linux (user utilities)
ii  drbdtop                              0.2.1-1                       
amd64        like top, but for drbd
ii  linstor-client                       0.9.5-1                       
all          Linstor client command line tool
ii  linstor-common                       0.9.10-1                      
all          DRBD distributed resource management utility
ii  linstor-proxmox                      3.0.3-1                       
all          DRBD distributed resource management utility
ii  linstor-satellite                    0.9.10-1                      
all          DRBD distributed resource management utility
ii  python-linstor                       0.9.5-1                       
all          Linstor python api library



Proxmox Error Message:


()
SUCCESS:
Description:
    New resource definition 'vm-102-state-test123' created.
Details:
    Resource definition 'vm-102-state-test123' UUID is:
7eab2e30-5a5a-4e57-a4e2-d5da7c58f770
SUCCESS:
Description:
    Resource definition 'vm-102-state-test123' modified.
Details:
    Resource definition 'vm-102-state-test123' UUID is:
7eab2e30-5a5a-4e57-a4e2-d5da7c58f770
SUCCESS:
    New volume definition with number '0' of resource definition
'vm-102-state-test123' created.
SUCCESS:
Description:
    Resource 'vm-102-state-test123' successfully autoplaced on 2 nodes
Details:
    Used storage pool: 'drbdpool'
    Used nodes: 'node1', 'node2'
SUCCESS:
    Created resource 'vm-102-state-test123' on 'node2'
SUCCESS:
    Created resource 'vm-102-state-test123' on 'node1'
SUCCESS:
Description:
    Resource 'vm-102-state-test123' on 'node2' ready
Details:
    Auto-placing resource: vm-102-state-test123
SUCCESS:
Description:
    Resource 'vm-102-state-test123' on 'node1' ready
Details:
    Auto-placing resource: vm-102-state-test123
SUCCESS:
Description:
    New snapshot 'snap_vm-102-disk-1_test123' of resource
'vm-102-disk-1' registered.
Details:
    Snapshot 'snap_vm-102-disk-1_test123' of resource 'vm-102-disk-1'
UUID is: 832dbdf2-b3dc-46f9-8f1d-ac9d20c1790c
SUCCESS:
    Suspended IO of 'vm-102-disk-1' on 'node2' for snapshot
SUCCESS:
    Suspended IO of 'vm-102-disk-1' on 'node1' for snapshot
ERROR:
Description:
    (Node: 'node2') Preparing resources for layer StorageLayer failed
Cause:
    External command timed out
Details:
    External command: lvs -o
lv_name,lv_path,lv_size,vg_name,pool_lv,data_percent,lv_attr --separator
; --noheadings --units k --nosuffix drbdpool
ERROR:
Description:
    (Node: 'node1') Preparing resources for layer StorageLayer failed
Cause:
    External command timed out
Details:
    External command: lvs -o
lv_name,lv_path,lv_size,vg_name,pool_lv,data_percent,lv_attr --separator
; --noheadings --units k --nosuffix drbdpool
ERROR:
Description:
    (Node: 'node2') Preparing resources for layer StorageLayer failed
Cause:
    External command timed out
Details:
    External command: lvs -o
lv_name,lv_path,lv_size,vg_name,pool_lv,data_percent,lv_attr --separator
; --noheadings --units k --nosuffix drbdpool
ERROR:
Description:
    (Node: 'node1') Preparing resources for layer StorageLayer failed
Cause:
    External command timed out
Details:
    External command: lvs -o
lv_name,lv_path,lv_size,vg_name,pool_lv,data_percent,lv_attr --separator
; --noheadings --units k --nosuffix drbdpool
VM 102 qmp command 'savevm-end' failed - unable to connect to VM 102 qmp
socket - timeout after 5992 retries
snapshot create failed: starting cleanup
error with cfs lock 'storage-drbdpool': Could not remove
vm-102-state-test123: got lock timeout - aborting command
TASK ERROR: Could not create cluster wide snapshot for: vm-102-disk-1:
exit code 10




Haning Tasks:


root at node2:~# systemctl status linstor-satellite.service
● linstor-satellite.service - LINSTOR Satellite Service
   Loaded: loaded (/lib/systemd/system/linstor-satellite.service;
enabled; vendor preset: enabled)
  Drop-In: /etc/systemd/system/linstor-satellite.service.d
           └─override.conf
   Active: active (running) since Tue 2019-05-28 17:42:11 CEST; 30min ago
 Main PID: 4612 (java)
    Tasks: 303 (limit: 14745)
   Memory: 263.1M
      CPU: 24.237s
   CGroup: /system.slice/linstor-satellite.service
           ├─ 4612 java -Xms32M -XX:MinHeapFreeRatio=40
-XX:MaxHeapFreeRatio=70 -classpath
/usr/share/linstor-server/lib/satellite-0.9.10.jar:/usr/share/linstor-server/lib/reactive-streams-1.0.2.jar:/usr/share/linstor-server/lib/logback-classic-1.2.3.ja
           ├─ 5406 drbdsetup events2 all
           ├─16954 lvs -o
lv_name,lv_path,lv_size,vg_name,pool_lv,data_percent,lv_attr --separator
; --noheadings --units k --nosuffix drbdpool
           ├─17236 vgs -o vg_name --noheadings
           ├─17525 lvs -o
lv_name,lv_path,lv_size,vg_name,pool_lv,data_percent,lv_attr --separator
; --noheadings --units k --nosuffix drbdpool
           ├─17788 vgs -o vg_name --noheadings
           ├─18075 vgs -o vg_name --noheadings
           ├─18323 vgs -o vg_name --noheadings
           ├─18612 vgs -o vg_name --noheadings
           ├─18889 vgs -o vg_name --noheadings
           ├─19162 vgs -o vg_name --noheadings
           ├─19442 vgs -o vg_name --noheadings
           ├─19700 vgs -o vg_name --noheadings
           ├─19984 vgs -o vg_name --noheadings
           ├─20242 vgs -o vg_name --noheadings
           ├─20523 lvs -o
lv_name,lv_path,lv_size,vg_name,pool_lv,data_percent,lv_attr --separator
; --noheadings --units k --nosuffix drbdpool
           ├─20779 vgs -o vg_name --noheadings
           ├─21047 lvs -o
lv_name,lv_path,lv_size,vg_name,pool_lv,data_percent,lv_attr --separator
; --noheadings --units k --nosuffix drbdpool
           ├─21310 vgs -o vg_name --noheadings
           ├─21572 lvs -o
lv_name,lv_path,lv_size,vg_name,pool_lv,data_percent,lv_attr --separator
; --noheadings --units k --nosuffix drbdpool
           ├─21821 vgs -o vg_name --noheadings
           ├─22083 lvs -o
lv_name,lv_path,lv_size,vg_name,pool_lv,data_percent,lv_attr --separator
; --noheadings --units k --nosuffix drbdpool
           ├─22346 vgs -o vg_name --noheadings
           ├─22612 lvs -o
lv_name,lv_path,lv_size,vg_name,pool_lv,data_percent,lv_attr --separator
; --noheadings --units k --nosuffix drbdpool
           ├─22864 vgs -o vg_name --noheadings
           └─23135 lvs -o
lv_name,lv_path,lv_size,vg_name,pool_lv,data_percent,lv_attr --separator
; --noheadings --units k --nosuffix drbdpool



Kernel Stack Trace:


[  967.562965] INFO: task kvm:18314 blocked for more than 120 seconds.
[  967.563540]       Tainted: P           O     4.15.18-14-pve #1
[  967.563974] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  967.564414] kvm             D    0 18314      1 0x00000000
[  967.564419] Call Trace:
[  967.564435]  __schedule+0x3e0/0x870
[  967.564440]  ? touch_pmd+0x32/0x40
[  967.564442]  schedule+0x36/0x80
[  967.564460]  __drbd_make_request+0x27c/0x5b0 [drbd]
[  967.564465]  ? wait_woken+0x80/0x80
[  967.564472]  drbd_make_request+0x4f/0x80 [drbd]
[  967.564477]  generic_make_request+0x123/0x2f0
[  967.564479]  submit_bio+0x73/0x140
[  967.564480]  ? submit_bio+0x73/0x140
[  967.564483]  blkdev_direct_IO+0x3c0/0x400
[  967.564486]  ? do_io_getevents+0xd0/0xd0
[  967.564489]  ? generic_update_time+0x7b/0xd0
[  967.564490]  ? current_time+0x32/0x70
[  967.564494]  generic_file_read_iter+0xb1/0xbb0
[  967.564496]  ? generic_file_read_iter+0xb1/0xbb0
[  967.564498]  ? __check_object_size+0xb3/0x190
[  967.564500]  ? _copy_from_user+0x36/0x70
[  967.564502]  blkdev_read_iter+0x35/0x40
[  967.564503]  aio_read+0xfc/0x140
[  967.564505]  do_io_submit+0x530/0x6c0
[  967.564506]  ? do_io_submit+0x530/0x6c0
[  967.564508]  SyS_io_submit+0x10/0x20
[  967.564509]  ? SyS_io_submit+0x10/0x20
[  967.564512]  do_syscall_64+0x73/0x130
[  967.564514]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  967.564517] RIP: 0033:0x7f1f217d6717
[  967.564517] RSP: 002b:00007f1de9999558 EFLAGS: 00000246 ORIG_RAX:
00000000000000d1
[  967.564519] RAX: ffffffffffffffda RBX: 0000000000000001 RCX:
00007f1f217d6717
[  967.564520] RDX: 00007f1de9999570 RSI: 0000000000000001 RDI:
00007f1f23117000
[  967.564521] RBP: 00007f1de9999570 R08: 0000000000000008 R09:
0000000000000000
[  967.564521] R10: 00007f1efc94ba60 R11: 0000000000000246 R12:
00007f1de9999560
[  967.564522] R13: 00007f1efc94ba38 R14: 00007f1efc94ba10 R15:
000000114f118a00
[  967.564531] INFO: task lvs:18730 blocked for more than 120 seconds.
[  967.564975]       Tainted: P           O     4.15.18-14-pve #1
[  967.565425] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  967.565894] lvs             D    0 18730   5826 0x00000004
[  967.565896] Call Trace:
[  967.565898]  __schedule+0x3e0/0x870
[  967.565900]  schedule+0x36/0x80
[  967.565907]  __drbd_make_request+0x27c/0x5b0 [drbd]
[  967.565909]  ? wait_woken+0x80/0x80
[  967.565915]  drbd_make_request+0x4f/0x80 [drbd]
[  967.565917]  generic_make_request+0x123/0x2f0
[  967.565918]  submit_bio+0x73/0x140
[  967.565920]  ? submit_bio+0x73/0x140
[  967.565921]  __blkdev_direct_IO_simple+0x1b1/0x380
[  967.565922]  ? __blkdev_get+0x240/0x4d0
[  967.565924]  ? bdget+0x120/0x120
[  967.565925]  blkdev_direct_IO+0x3f1/0x400
[  967.565926]  ? blkdev_direct_IO+0x3f1/0x400
[  967.565929]  ? dput+0xc7/0x1f0
[  967.565931]  ? mntput+0x24/0x40
[  967.565932]  ? terminate_walk+0x91/0xf0
[  967.565934]  ? path_openat+0x6f7/0x14a0
[  967.565935]  ? putname+0x54/0x60
[  967.565936]  generic_file_read_iter+0xb1/0xbb0
[  967.565938]  ? generic_file_read_iter+0xb1/0xbb0
[  967.565944]  ? kref_debug_sub+0x3c/0x60 [drbd]
[  967.565946]  ? _copy_to_user+0x2a/0x40
[  967.565948]  ? cp_new_stat+0x152/0x180
[  967.565949]  blkdev_read_iter+0x35/0x40
[  967.565951]  new_sync_read+0xe4/0x130
[  967.565952]  __vfs_read+0x29/0x40
[  967.565954]  vfs_read+0x96/0x130
[  967.565955]  SyS_read+0x55/0xc0
[  967.565957]  do_syscall_64+0x73/0x130
[  967.565958]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  967.565959] RIP: 0033:0x7f1d049a9910
[  967.565960] RSP: 002b:00007ffec8316918 EFLAGS: 00000246 ORIG_RAX:
0000000000000000
[  967.565961] RAX: ffffffffffffffda RBX: 0000000000000000 RCX:
00007f1d049a9910
[  967.565962] RDX: 0000000000001000 RSI: 0000556670ae3000 RDI:
0000000000000004
[  967.565963] RBP: 00007ffec8316970 R08: 00007f1d04c68258 R09:
0000000000001000
[  967.565963] R10: 0000000000000030 R11: 0000000000000246 R12:
0000556670ae3000
[  967.565964] R13: 0000000000000000 R14: 0000000000000004 R15:
00007ffec83169d0
[  967.565968] INFO: task lvs:18773 blocked for more than 120 seconds.
[  967.566433]       Tainted: P           O     4.15.18-14-pve #1
[  967.566944] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  967.567435] lvs             D    0 18773  18766 0x00000004
[  967.567438] Call Trace:
[  967.567441]  __schedule+0x3e0/0x870
[  967.567443]  schedule+0x36/0x80
[  967.567450]  __drbd_make_request+0x27c/0x5b0 [drbd]
[  967.567452]  ? wait_woken+0x80/0x80
[  967.567458]  drbd_make_request+0x4f/0x80 [drbd]
[  967.567460]  generic_make_request+0x123/0x2f0
[  967.567462]  submit_bio+0x73/0x140
[  967.567464]  ? submit_bio+0x73/0x140
[  967.567465]  __blkdev_direct_IO_simple+0x1b1/0x380
[  967.567467]  ? __blkdev_get+0x240/0x4d0
[  967.567468]  ? bdget+0x120/0x120
[  967.567470]  blkdev_direct_IO+0x3f1/0x400
[  967.567471]  ? blkdev_direct_IO+0x3f1/0x400
[  967.567472]  ? dput+0xc7/0x1f0
[  967.567474]  ? mntput+0x24/0x40
[  967.567476]  ? terminate_walk+0x91/0xf0
[  967.567478]  ? path_openat+0x6f7/0x14a0
[  967.567479]  ? putname+0x54/0x60
[  967.567480]  generic_file_read_iter+0xb1/0xbb0
[  967.567482]  ? generic_file_read_iter+0xb1/0xbb0
[  967.567488]  ? kref_debug_sub+0x3c/0x60 [drbd]
[  967.567490]  ? _copy_to_user+0x2a/0x40
[  967.567492]  ? cp_new_stat+0x152/0x180
[  967.567493]  blkdev_read_iter+0x35/0x40
[  967.567494]  new_sync_read+0xe4/0x130
[  967.567497]  __vfs_read+0x29/0x40
[  967.567499]  vfs_read+0x96/0x130
[  967.567500]  SyS_read+0x55/0xc0
[  967.567502]  do_syscall_64+0x73/0x130
[  967.567504]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  967.567505] RIP: 0033:0x7f313938e910
[  967.567506] RSP: 002b:00007fffa10d77c8 EFLAGS: 00000246 ORIG_RAX:
0000000000000000
[  967.567507] RAX: ffffffffffffffda RBX: 0000000000000000 RCX:
00007f313938e910
[  967.567508] RDX: 0000000000001000 RSI: 0000564dbe7da000 RDI:
0000000000000004
[  967.567509] RBP: 00007fffa10d7820 R08: 00007f313964d248 R09:
0000000000001000
[  967.567509] R10: 0000000000000030 R11: 0000000000000246 R12:
0000564dbe7da000
[  967.567510] R13: 0000000000000000 R14: 0000000000000004 R15:
00007fffa10d7880




We also have LVM_THIN Storage Pools.


Maybe there's anybody that could help us and resolve this Thread and the
one before.



Kind Regards,

Alex



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20190528/d974ac08/attachment-0001.htm>


More information about the drbd-user mailing list