[DRBD-user] linstor-proxmox hangs forever because of tainted kernel

Alexander Karamanlidis alexander.karamanlidis at lindenbaum.eu
Wed May 29 11:02:10 CEST 2019

Am 29.05.19 um 10:26 schrieb Robert Altnoeder:
> On 5/28/19 6:16 PM, Alexander Karamanlidis wrote:
>> hangs forever because of tainted kernel
> Those hangs have nothing to do with the taint status that the kernel
> shows, since none of the problem-related taint flags are set.
> The kernel shows a taint of P O, which is
> - P: Proprietary module loaded
> - O: Out-of-tree module loaded
> That is a normal runtime status that does not indicate any problems.
> What's more interesting are the messages emitted by LINSTOR:
>>     Suspended IO of 'vm-102-disk-1' on 'node2' for snapshot
>>     Suspended IO of 'vm-102-disk-1' on 'node1' for snapshot
>> Description:
>>     (Node: 'node1') Preparing resources for layer StorageLayer failed
>> Cause:
>>     External command timed out
>> Details:
>>     External command: lvs -o
>> lv_name,lv_path,lv_size,vg_name,pool_lv,data_percent,lv_attr
>> --separator ; --noheadings --units k --nosuffix drbdpool
>> VM 102 qmp command 'savevm-end' failed - unable to connect to VM 102
>> qmp socket - timeout after 5992 retries
>> snapshot create failed: starting cleanup
>> error with cfs lock 'storage-drbdpool': Could not remove
>> vm-102-state-test123: got lock timeout - aborting command
>> TASK ERROR: Could not create cluster wide snapshot for: vm-102-disk-1:
>> exit code 10
> Looks like LVM, or some subtask of it, is accessing the storage of
> vm-102-disk-1 through DRBD (maybe LVM scanning DRBD devices), which will
> hang, because I/O on that device is suspended in order to take a
> cluster-wide consistent snapshot.

Correct, yes. The subtask is the command linstor is executing above.
"lvs -o lv_name,lv_path,lv_size,vg_name,pool_lv,data_percent,lv_attr
--separator ";" --noheadings --units k --nosuffix drbdpool"
The snapshot process stops at this point and after 180 Seconds i get the
traces i provided over and over again.
However, if executed in a non-snapshot process (normally from bash) it
just works fine, so i guess it could
have something to do with the suspended I/O. However i don't really
understand how this can occurr.

> My guess is that this is an LVM configuration error that causes LVM to
> access DRBD devices, a very common source of timeout problems of all kinds.

We didn't configure anything special for our LVM_THIN Storage pools,
except, that we increased the metadata size to 4G.
And i couldn't get any information about settings to set when using
linstor with proxmox and LVM_THIN Storage-Pools.
For double checking these are the exact steps we used:


ctrl slot=0 create type=ld
drives=1I:3:1,1I:3:2,1I:3:3,1I:3:4,2I:3:5,2I:3:6,2I:3:7,2I:3:8 raid=1+0
ctrl slot=0 array B add spares=4I:2:6

pvcreate /dev/sdb
vgcreate drbdpool /dev/sdb

lvcreate -l95%FREE --thinpool drbdpool/drbdpool
lvextend --poolmetadatasize +4G drbdpool/drbdpool

Just to make sure we also changed some drbd-options in the
linstor-controller, so that they would be more suitable for our 25G
dedicated DRBD Network.

These were the following:

linstor controller drbd-options \
--after-sb-0pri=discard-zero-changes \
--after-sb-1pri=discard-secondary \

linstor controller drbd-options \
--max-buffers=36864 \
--rcvbuf-size=2097152 \

linstor controller drbd-options \
--c-fill-target=10240 \
--c-max-rate=737280 \
--c-min-rate=20480 \

linstor controller drbd-options --verify-alg sha1 --csums-alg sha1
linstor controller drbd-options --resync-rate=2000000

If there is some more configuration needed or we missconfigured
something, we were not aware of it.

>> We also have LVM_THIN Storage Pools. 
> Those also block whenever they run full, so checking that may be a good
> idea too.

Did that real quick. Seems like we have enought space left.

root at node1:~# lvs
  LV                  VG       Attr       LSize   Pool     Origin Data% 
Meta%  Move Log Cpy%Sync Convert
  drbdpool            drbdpool twi-aotz--   6.64t                 6.56  
  vm-100-disk-0_00000 drbdpool Vwi-aotz--   4.00g drbdpool       
  vm-102-disk-1_00000 drbdpool Vwi-aotz-- 100.02g drbdpool       
  vm-103-disk-1_00000 drbdpool Vwi-aotz-- 100.02g drbdpool       
  vm-104-disk-1_00000 drbdpool Vwi-aotz--   5.00g drbdpool       
  vm-104-disk-2_00000 drbdpool Vwi-aotz-- 100.02g drbdpool       
  vm-105-disk-1_00000 drbdpool Vwi-aotz--   5.00g drbdpool       
  vm-105-disk-2_00000 drbdpool Vwi-aotz-- 115.03g drbdpool       
  vm-106-disk-1_00000 drbdpool Vwi-aotz--   5.00g drbdpool       
  vm-106-disk-2_00000 drbdpool Vwi-aotz-- 215.05g drbdpool       
  vm-107-disk-1_00000 drbdpool Vwi-aotz--   5.00g drbdpool       
  vm-108-disk-1_00000 drbdpool Vwi-aotz--   5.00g drbdpool       

Maybe worth to mention that we configured linstor and tested the
snapshot function yesterday.

So it never worked for us. Just to clarify, that this hasn't worked in
the past.

Just if it matters, this is our storage.cfg proxmox config entry for the
DRBD resources:

drbd: drbdpool
        content images,rootdir
        controllervm 100
        nodes node2,node1
        redundancy 2
> br,
> Robert

Thanks for the quick reply Robert

