[DRBD-user] linstor issues

Tue Jun 23 04:53:23 CEST 2020

I've tried to follow the limited documentation on installing DRBD 9 and 
linstor, and sort of managed to get things working. I have three nodes 
(castle, san5 and san6). I've re-built the various ubuntu packages under 
debian, and installed on debian buster on all three machines:

drbd-dkms_9.0.22-1ppa1~bionic1_all.deb
drbd-utils_9.13.1-1ppa1~bionic1_amd64.deb
linstor-controller_1.7.1-1ppa1~bionic1_all.deb
linstor-satellite_1.7.1-1ppa1~bionic1_all.deb
linstor-common_1.7.1-1ppa1~bionic1_all.deb
python-linstor_1.1.1-1ppa1~bionic1_all.deb
linstor-client_1.1.1-1ppa1~bionic1_all.deb

After adding the three nodes I had this output:
linstor node list
╭──────────────────────────────────────────────────────────╮
┊ Node   ┊ NodeType  ┊ Addresses                  ┊ State  ┊
╞══════════════════════════════════════════════════════════╡
┊ castle ┊ SATELLITE ┊ <IP>.204:3366 (PLAIN) ┊ Online ┊
┊ san5   ┊ SATELLITE ┊ <IP>.205:3366 (PLAIN) ┊ Online ┊
┊ san6   ┊ SATELLITE ┊ <IP>.206:3366 (PLAIN) ┊ Online ┊
╰──────────────────────────────────────────────────────────╯

Then I added some storage pools:
linstor storage-pool list
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ StoragePool          ┊ Node   ┊ Driver   ┊ PoolName ┊ FreeCapacity ┊ 
TotalCapacity ┊ CanSnapshots ┊ State ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ DfltDisklessStorPool ┊ castle ┊ DISKLESS ┊ ┊              
┊               ┊ False        ┊ Ok    ┊
┊ DfltDisklessStorPool ┊ san5   ┊ DISKLESS ┊ ┊              
┊               ┊ False        ┊ Ok    ┊
┊ DfltDisklessStorPool ┊ san6   ┊ DISKLESS ┊ ┊              
┊               ┊ False        ┊ Ok    ┊
┊ pool                 ┊ castle ┊ LVM      ┊ vg_hdd   ┊     3.44 TiB 
┊      3.44 TiB ┊ False        ┊ Ok    ┊
┊ pool                 ┊ san5   ┊ LVM      ┊ vg_hdd   ┊     4.36 TiB 
┊      4.36 TiB ┊ False        ┊ Ok    ┊
┊ pool                 ┊ san6   ┊ LVM      ┊ vg_ssd   ┊     1.75 TiB 
┊      1.75 TiB ┊ False        ┊ Ok    ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Again, everything was looking pretty good.

So, I tried to create a resource, and then I got this:

linstor resource list
╭────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node   ┊ Port ┊ Usage  ┊ Conns ┊    State ┊
╞════════════════════════════════════════════════════════════════════════════╡
┊ testvm1      ┊ castle ┊ 7000 ┊        ┊ ┊  Unknown ┊
┊ testvm1      ┊ san5   ┊ 7000 ┊        ┊ ┊  Unknown ┊
┊ testvm1      ┊ san6   ┊ 7000 ┊ Unused ┊ Connecting(san5,castle) ┊ 
UpToDate ┊
╰────────────────────────────────────────────────────────────────────────────╯

There hasn't been any change in over 24 hours, so I'm guessing there is 
something stuck/not working, but I don't seem to have many clues on what 
it might be.

I've checked through the docs at: 
https://www.linbit.com/drbd-user-guide/linstor-guide-1_0-en/ and found 
these two commands in section 2.7 Checking the state of your cluster:

# linstor node list
# linstor storage-pool list --groupby Size

However, the second command produces a usage error (documentation bug 
perhaps). Editing the command to something valid produces:
linstor storage-pool list --groupby Node
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ StoragePool          ┊ Node   ┊ Driver   ┊ PoolName ┊ FreeCapacity ┊ 
TotalCapacity ┊ CanSnapshots ┊ State   ┊
╞═════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ DfltDisklessStorPool ┊ castle ┊ DISKLESS ┊ ┊              
┊               ┊ False        ┊ Ok      ┊
┊ pool                 ┊ castle ┊ LVM      ┊ vg_hdd   ┊     3.44 TiB 
┊      3.44 TiB ┊ False        ┊ Ok      ┊
┊ DfltDisklessStorPool ┊ san5   ┊ DISKLESS ┊ ┊              
┊               ┊ False        ┊ Warning ┊
┊ pool                 ┊ san5   ┊ LVM      ┊ vg_hdd ┊              
┊               ┊ False        ┊ Warning ┊
┊ DfltDisklessStorPool ┊ san6   ┊ DISKLESS ┊ ┊              
┊               ┊ False        ┊ Ok      ┊
┊ pool                 ┊ san6   ┊ LVM      ┊ vg_ssd   ┊     1.26 TiB 
┊      1.75 TiB ┊ False        ┊ Ok      ┊
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
WARNING:
Description:
     No active connection to satellite 'san5'
Details:
     The controller is trying to (re-) establish a connection to the 
satellite. The controller stored the changes and as soon the satellite 
is connected, it will receive this update.

Note, after waiting approx 20hours, san5 was shutdown cleanly, so is 
currently offline.

dmesg on san6 includes this:
[95078.272184] drbd testvm1: Starting worker thread (from drbdsetup [2398])
[95078.285272] drbd testvm1 castle: Starting sender thread (from 
drbdsetup [2402])
[95078.290733] drbd testvm1 san5: Starting sender thread (from drbdsetup 
[2406])
[95078.310399] drbd testvm1/0 drbd1000: meta-data IO uses: blk-bio
[95078.310500] drbd testvm1/0 drbd1000: rs_discard_granularity feature 
disabled
[95078.310767] drbd testvm1/0 drbd1000: disk( Diskless -> Attaching )
[95078.310775] drbd testvm1/0 drbd1000: Maximum number of peer devices = 7
[95078.310864] drbd testvm1: Method to ensure write ordering: flush
[95078.310867] drbd testvm1/0 drbd1000: Adjusting my ra_pages to backing 
device's (32 -> 1024)
[95078.310870] drbd testvm1/0 drbd1000: drbd_bm_resize called with 
capacity == 1048581248
[95078.418753] drbd testvm1/0 drbd1000: resync bitmap: bits=131072656 
words=14336077 pages=28001
[95078.418757] drbd testvm1/0 drbd1000: size = 500 GB (524290624 KB)
[95078.593417] drbd testvm1/0 drbd1000: recounting of set bits took 
additional 64ms
[95078.593429] drbd testvm1/0 drbd1000: disk( Attaching -> Inconsistent 
) quorum( no -> yes )
[95078.593431] drbd testvm1/0 drbd1000: attached to current UUID: 
0000000000000004
[95078.595412] drbd testvm1 castle: conn( StandAlone -> Unconnected )
[95078.596649] drbd testvm1 san5: conn( StandAlone -> Unconnected )
[95078.599430] drbd testvm1 castle: Starting receiver thread (from 
drbd_w_testvm1 [2399])
[95078.599742] drbd testvm1 san5: Starting receiver thread (from 
drbd_w_testvm1 [2399])
[95078.599813] drbd testvm1 castle: conn( Unconnected -> Connecting )
[95078.604454] drbd testvm1 san5: conn( Unconnected -> Connecting )
[95079.113391] drbd testvm1/0 drbd1000: rs_discard_granularity feature 
disabled
[95079.146175] drbd testvm1: Preparing cluster-wide state change 
1272763172 (2->-1 7683/4609)
[95079.146178] drbd testvm1: Committing cluster-wide state change 
1272763172 (0ms)
[95079.146184] drbd testvm1: role( Secondary -> Primary )
[95079.146186] drbd testvm1/0 drbd1000: disk( Inconsistent -> UpToDate )
[95079.146256] drbd testvm1/0 drbd1000: size = 500 GB (524290624 KB)
[95079.152264] drbd testvm1: Forced to consider local data as UpToDate!
[95079.156608] drbd testvm1/0 drbd1000: new current UUID: 
60E1FC2F9926E84B weak: FFFFFFFFFFFFFFFB
[95079.159415] drbd testvm1: role( Primary -> Secondary )

----- a few weeks later...

I wrote the above intending to have another go at this later, and so now 
I have san5 back online, and have rebooted both castle and san6, now my 
status on all three is:
linstor n l
╭───────────────────────────────────────────────────────────╮
┊ Node   ┊ NodeType  ┊ Addresses                  ┊ State   ┊
╞═══════════════════════════════════════════════════════════╡
┊ castle ┊ SATELLITE ┊ 192.168.5.204:3366 (PLAIN) ┊ Unknown ┊
┊ san5   ┊ SATELLITE ┊ 192.168.5.205:3366 (PLAIN) ┊ Unknown ┊
┊ san6   ┊ SATELLITE ┊ 192.168.5.206:3366 (PLAIN) ┊ Unknown ┊
╰───────────────────────────────────────────────────────────╯

Is there any other documentation on what to do when things go wrong? A 
checklist to find where the problem might be? With the old DRBD 8.4 
/proc/drbd or dmesg seemed to be the two main sources of information, 
but now I seem quite out of my depth. Any clues or suggestions on things 
to check, additional information to provide/etc would be greatly 
appreciated.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20200623/d956be26/attachment-0001.htm>