[DRBD-user] linstor-proxmox hangs forever because of tainted kernel
Alexander Karamanlidis
alexander.karamanlidis at lindenbaum.eu
Wed May 29 11:02:10 CEST 2019
Am 29.05.19 um 10:26 schrieb Robert Altnoeder:
> On 5/28/19 6:16 PM, Alexander Karamanlidis wrote:
>
>> hangs forever because of tainted kernel
> Those hangs have nothing to do with the taint status that the kernel
> shows, since none of the problem-related taint flags are set.
> The kernel shows a taint of P O, which is
>
> - P: Proprietary module loaded
> - O: Out-of-tree module loaded
>
> That is a normal runtime status that does not indicate any problems.
>
>
> What's more interesting are the messages emitted by LINSTOR:
>
>> SUCCESS:
>>
>> Suspended IO of 'vm-102-disk-1' on 'node2' for snapshot
>> SUCCESS:
>> Suspended IO of 'vm-102-disk-1' on 'node1' for snapshot
>>
>> ERROR:
>> Description:
>> (Node: 'node1') Preparing resources for layer StorageLayer failed
>> Cause:
>> External command timed out
>> Details:
>> External command: lvs -o
>> lv_name,lv_path,lv_size,vg_name,pool_lv,data_percent,lv_attr
>> --separator ; --noheadings --units k --nosuffix drbdpool
>> VM 102 qmp command 'savevm-end' failed - unable to connect to VM 102
>> qmp socket - timeout after 5992 retries
>> snapshot create failed: starting cleanup
>> error with cfs lock 'storage-drbdpool': Could not remove
>> vm-102-state-test123: got lock timeout - aborting command
>> TASK ERROR: Could not create cluster wide snapshot for: vm-102-disk-1:
>> exit code 10
>>
> Looks like LVM, or some subtask of it, is accessing the storage of
> vm-102-disk-1 through DRBD (maybe LVM scanning DRBD devices), which will
> hang, because I/O on that device is suspended in order to take a
> cluster-wide consistent snapshot.
Correct, yes. The subtask is the command linstor is executing above.
"lvs -o lv_name,lv_path,lv_size,vg_name,pool_lv,data_percent,lv_attr
--separator ";" --noheadings --units k --nosuffix drbdpool"
The snapshot process stops at this point and after 180 Seconds i get the
traces i provided over and over again.
However, if executed in a non-snapshot process (normally from bash) it
just works fine, so i guess it could
have something to do with the suspended I/O. However i don't really
understand how this can occurr.
> My guess is that this is an LVM configuration error that causes LVM to
> access DRBD devices, a very common source of timeout problems of all kinds.
>
We didn't configure anything special for our LVM_THIN Storage pools,
except, that we increased the metadata size to 4G.
And i couldn't get any information about settings to set when using
linstor with proxmox and LVM_THIN Storage-Pools.
For double checking these are the exact steps we used:
ssacli
ctrl slot=0 create type=ld
drives=1I:3:1,1I:3:2,1I:3:3,1I:3:4,2I:3:5,2I:3:6,2I:3:7,2I:3:8 raid=1+0
ctrl slot=0 array B add spares=4I:2:6
pvcreate /dev/sdb
vgcreate drbdpool /dev/sdb
lvcreate -l95%FREE --thinpool drbdpool/drbdpool
lvextend --poolmetadatasize +4G drbdpool/drbdpool
Just to make sure we also changed some drbd-options in the
linstor-controller, so that they would be more suitable for our 25G
dedicated DRBD Network.
These were the following:
linstor controller drbd-options \
--after-sb-0pri=discard-zero-changes \
--after-sb-1pri=discard-secondary \
--after-sb-2pri=disconnect
linstor controller drbd-options \
--max-buffers=36864 \
--rcvbuf-size=2097152 \
--sndbuf-size=1048576
linstor controller drbd-options \
--c-fill-target=10240 \
--c-max-rate=737280 \
--c-min-rate=20480 \
--c-plan-ahead=10
linstor controller drbd-options --verify-alg sha1 --csums-alg sha1
linstor controller drbd-options --resync-rate=2000000
If there is some more configuration needed or we missconfigured
something, we were not aware of it.
>> We also have LVM_THIN Storage Pools.
> Those also block whenever they run full, so checking that may be a good
> idea too.
Did that real quick. Seems like we have enought space left.
root at node1:~# lvs
LV VG Attr LSize Pool Origin Data%
Meta% Move Log Cpy%Sync Convert
drbdpool drbdpool twi-aotz-- 6.64t 6.56
1.52
vm-100-disk-0_00000 drbdpool Vwi-aotz-- 4.00g drbdpool
99.98
vm-102-disk-1_00000 drbdpool Vwi-aotz-- 100.02g drbdpool
100.00
vm-103-disk-1_00000 drbdpool Vwi-aotz-- 100.02g drbdpool
76.55
vm-104-disk-1_00000 drbdpool Vwi-aotz-- 5.00g drbdpool
100.00
vm-104-disk-2_00000 drbdpool Vwi-aotz-- 100.02g drbdpool
69.36
vm-105-disk-1_00000 drbdpool Vwi-aotz-- 5.00g drbdpool
100.00
vm-105-disk-2_00000 drbdpool Vwi-aotz-- 115.03g drbdpool
52.83
vm-106-disk-1_00000 drbdpool Vwi-aotz-- 5.00g drbdpool
100.00
vm-106-disk-2_00000 drbdpool Vwi-aotz-- 215.05g drbdpool
29.55
vm-107-disk-1_00000 drbdpool Vwi-aotz-- 5.00g drbdpool
0.02
vm-108-disk-1_00000 drbdpool Vwi-aotz-- 5.00g drbdpool
100.00
Maybe worth to mention that we configured linstor and tested the
snapshot function yesterday.
So it never worked for us. Just to clarify, that this hasn't worked in
the past.
Just if it matters, this is our storage.cfg proxmox config entry for the
DRBD resources:
drbd: drbdpool
content images,rootdir
controller 10.1.128.158
controllervm 100
nodes node2,node1
redundancy 2
>
> br,
> Robert
Thanks for the quick reply Robert
BR,
Alex
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user
--
Freundliche Grüße
Kind regards,
Alexander Karamanlidis
IT Systemadministrator
Phone: +49 721 480 848 – 609
Lindenbaum GmbH Conferencing - Virtual Dialogues
Facebook | LinkedIn | Youtube | Website
Head office: Ludwig-Erhard-Allee 34 im Park Office, 76131 Karlsruhe
Registration court: Amtsgericht Mannheim, HRB 706184
Managing director: Maarten Kronenburg
Tax number: 35007/02060, USt. ID: DE 263797265
Lindenbaum auf der CallCenterWorld – und auf dem Mobile World Congress
More information about the drbd-user
mailing list