[DRBD-user] linstor issues

Adam Goryachev mailinglists at websitemanagers.com.au
Wed Jun 24 03:46:07 CEST 2020


On 23/6/20 21:53, Gábor Hernádi wrote:
> Hi,
>
> apparently something is quite broken... maybe it's somehow your setup 
> or environment, I am not sure...
>
>     linstor resource list
>     ╭────────────────────────────────────────────────────────────────────────────╮
>     ┊ ResourceName ┊ Node   ┊ Port ┊ Usage  ┊ Conns                  
>     ┊    State ┊
>     ╞════════════════════════════════════════════════════════════════════════════╡
>     ┊ testvm1      ┊ castle ┊ 7000 ┊ ┊                         ┊ 
>     Unknown ┊
>     ┊ testvm1      ┊ san5   ┊ 7000 ┊ ┊                         ┊ 
>     Unknown ┊
>     ┊ testvm1      ┊ san6   ┊ 7000 ┊ Unused ┊ Connecting(san5,castle)
>     ┊ UpToDate ┊
>     ╰────────────────────────────────────────────────────────────────────────────╯
>
> This looks like some kind of network issues.
>
>     # linstor storage-pool list --groupby Size
>
>     However, the second command produces a usage error (documentation
>     bug perhaps).
>
>
> Thanks for reporting, we will look into this.
>
>     WARNING:
>     Description:
>         No active connection to satellite 'san5'
>     Details:
>         The controller is trying to (re-) establish a connection to
>     the satellite. The controller stored the changes and as soon the
>     satellite is connected, it will receive this update.
>
>
> So Linstor has obviously no connection to satellite 'san5'.
>
>     [95078.599813] drbd testvm1 castle: conn( Unconnected -> Connecting )
>     [95078.604454] drbd testvm1 san5: conn( Unconnected -> Connecting )
>
>
> ... and DRBD apparently also has troubles connecting...
>
>     linstor n l
>     ╭───────────────────────────────────────────────────────────╮
>     ┊ Node   ┊ NodeType  ┊ Addresses                  ┊ State   ┊
>     ╞═══════════════════════════════════════════════════════════╡
>     ┊ castle ┊ SATELLITE ┊ 192.168.5.204:3366
>     <http://192.168.5.204:3366> (PLAIN) ┊ Unknown ┊
>     ┊ san5   ┊ SATELLITE ┊ 192.168.5.205:3366
>     <http://192.168.5.205:3366> (PLAIN) ┊ Unknown ┊
>     ┊ san6   ┊ SATELLITE ┊ 192.168.5.206:3366
>     <http://192.168.5.206:3366> (PLAIN) ┊ Unknown ┊
>     ╰───────────────────────────────────────────────────────────╯
>
>
> Now  this is really strange. I will spare you with some details, but I 
> assume you have triggered some bad exception in Linstor which somehow 
> killed a necessary thread.
> You should check
>    linstor err list
> and see if you can find some related error reports.
> Also, restarting the controller might help you here.
>
Thank you!

linstor err list showed a list of errors, but the contents didn't make a 
lot of sense to me. Let me know if you are interested in them, and I can 
send them.

I did a systemctl restart linstor-controller.service on san6, and things 
started looking much better.

linstor n l
╭──────────────────────────────────────────────────────────╮
┊ Node   ┊ NodeType  ┊ Addresses                  ┊ State  ┊
╞══════════════════════════════════════════════════════════╡
┊ castle ┊ SATELLITE ┊ 192.168.5.204:3366 (PLAIN) ┊ Online ┊
┊ san5   ┊ SATELLITE ┊ 192.168.5.205:3366 (PLAIN) ┊ Online ┊
┊ san6   ┊ SATELLITE ┊ 192.168.5.206:3366 (PLAIN) ┊ Online ┊
╰──────────────────────────────────────────────────────────╯

So, all nodes agree that they are now online and talking to each other. 
I assume this proves there is no network issues.

linstor resource list
╭─────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node   ┊ Port ┊ Usage  ┊ Conns ┊              State ┊
╞═════════════════════════════════════════════════════════════════════════════════╡
┊ testvm1      ┊ castle ┊ 7000 ┊        ┊ ┊            Unknown ┊
┊ testvm1      ┊ san5   ┊ 7000 ┊ Unused ┊ Connecting(castle) ┊ 
SyncTarget(12.67%) ┊
┊ testvm1      ┊ san6   ┊ 7000 ┊ Unused ┊ Connecting(castle) ┊           
UpToDate ┊
╰─────────────────────────────────────────────────────────────────────────────────╯

 From this, it looks like san6 (the controller) thinks it has the up to 
date data, probably based on the fact it was created there first or 
something. The data is syncing to san5 (in progress, and progressing 
steadily), so that is good also. However, castle doesn't seem to be 
syncing/connecting.

On castle, I see this:

Jun 24 11:01:55 castle Satellite[7499]: 11:01:55.177 [DeviceManager] 
ERROR LINSTOR/Satellite - SYSTEM - Failed to create meta-data for DRBD 
volume testvm1/0 [Report number 5EF2A316-31431-000002]

linstor err show give this:

ERROR REPORT 5EF2A316-31431-000002

============================================================

Application:                        LINBIT® LINSTOR
Module:                             Satellite
Version:                            1.7.1
Build ID: 6760637d6fae7a5862103ced4ea0ab0a758861f9
Build time:                         2020-05-14T13:14:11+00:00
Error time:                         2020-06-24 11:01:55
Node:                               castle

============================================================

Reported error:
===============

Description:
     Failed to create meta-data for DRBD volume testvm1/0

Category:                           LinStorException
Class name:                         VolumeException
Class canonical name: 
com.linbit.linstor.storage.layer.exceptions.VolumeException
Generated at:                       Method 'createMetaData', Source file 
'DrbdLayer.java', Line #995

Error message:                      Failed to create meta-data for DRBD 
volume testvm1/0

Error context:
     An error occurred while processing resource 'Node: 'castle', Rsc: 
'testvm1''

Call backtrace:

     Method                                   Native Class:Line number
     createMetaData                           N 
com.linbit.linstor.storage.layer.adapter.drbd.DrbdLayer:995
     adjustDrbd                               N 
com.linbit.linstor.storage.layer.adapter.drbd.DrbdLayer:575
     process                                  N 
com.linbit.linstor.storage.layer.adapter.drbd.DrbdLayer:373
     process                                  N 
com.linbit.linstor.core.devmgr.DeviceHandlerImpl:731
     processResourcesAndSnapshots             N 
com.linbit.linstor.core.devmgr.DeviceHandlerImpl:300
     dispatchResources                        N 
com.linbit.linstor.core.devmgr.DeviceHandlerImpl:138
     dispatchResources                        N 
com.linbit.linstor.core.devmgr.DeviceManagerImpl:258
     phaseDispatchDeviceHandlers              N 
com.linbit.linstor.core.devmgr.DeviceManagerImpl:896
     devMgrLoop                               N 
com.linbit.linstor.core.devmgr.DeviceManagerImpl:618
     run                                      N 
com.linbit.linstor.core.devmgr.DeviceManagerImpl:535
     run                                      N java.lang.Thread:834

Caused by:
==========

Description:
     Execution of the external command 'drbdadm' failed.
Cause:
     The external command exited with error code 1.
Correction:
     - Check whether the external program is operating properly.
     - Check whether the command line is correct.
       Contact a system administrator or a developer if the command line 
is no longer valid
       for the installed version of the external program.
Additional information:
     The full command line executed was:
     drbdadm -vvv --max-peers 7 -- --force create-md testvm1/0

     The external command sent the following output data:


     The external command sent the following error information:
     no resources defined!


Category:                           LinStorException
Class name:                         ExtCmdFailedException
Class canonical name: com.linbit.extproc.ExtCmdFailedException
Generated at:                       Method 'execute', Source file 
'DrbdAdm.java', Line #550

Error message:                      The external command 'drbdadm' 
exited with error code 1


Call backtrace:

     Method                                   Native Class:Line number
     execute                                  N 
com.linbit.linstor.storage.layer.adapter.drbd.utils.DrbdAdm:550
     simpleAdmCommand                         N 
com.linbit.linstor.storage.layer.adapter.drbd.utils.DrbdAdm:495
     createMd                                 N 
com.linbit.linstor.storage.layer.adapter.drbd.utils.DrbdAdm:262
     createMetaData                           N 
com.linbit.linstor.storage.layer.adapter.drbd.DrbdLayer:923
     adjustDrbd                               N 
com.linbit.linstor.storage.layer.adapter.drbd.DrbdLayer:575
     process                                  N 
com.linbit.linstor.storage.layer.adapter.drbd.DrbdLayer:373
     process                                  N 
com.linbit.linstor.core.devmgr.DeviceHandlerImpl:731
     processResourcesAndSnapshots             N 
com.linbit.linstor.core.devmgr.DeviceHandlerImpl:300
     dispatchResources                        N 
com.linbit.linstor.core.devmgr.DeviceHandlerImpl:138
     dispatchResources                        N 
com.linbit.linstor.core.devmgr.DeviceManagerImpl:258
     phaseDispatchDeviceHandlers              N 
com.linbit.linstor.core.devmgr.DeviceManagerImpl:896
     devMgrLoop                               N 
com.linbit.linstor.core.devmgr.DeviceManagerImpl:618
     run                                      N 
com.linbit.linstor.core.devmgr.DeviceManagerImpl:535
     run                                      N java.lang.Thread:834


END OF ERROR REPORT.

Indeed, re-running the same command from the CLI provides the shown 
error message:

drbdadm -vvv --max-peers 7 -- --force create-md testvm1/0
no resources defined!

Some other random status information which may or may not be relevant...

linstor storage-pool list
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ StoragePool          ┊ Node   ┊ Driver   ┊ PoolName ┊ FreeCapacity ┊ 
TotalCapacity ┊ CanSnapshots ┊ State ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ DfltDisklessStorPool ┊ castle ┊ DISKLESS ┊ ┊              
┊               ┊ False        ┊ Ok    ┊
┊ DfltDisklessStorPool ┊ san5   ┊ DISKLESS ┊ ┊              
┊               ┊ False        ┊ Ok    ┊
┊ DfltDisklessStorPool ┊ san6   ┊ DISKLESS ┊ ┊              
┊               ┊ False        ┊ Ok    ┊
┊ pool                 ┊ castle ┊ LVM      ┊ vg_hdd   ┊     2.95 TiB 
┊      3.44 TiB ┊ False        ┊ Ok    ┊
┊ pool                 ┊ san5   ┊ LVM      ┊ vg_hdd   ┊     3.87 TiB 
┊      4.36 TiB ┊ False        ┊ Ok    ┊
┊ pool                 ┊ san6   ┊ LVM      ┊ vg_ssd   ┊     1.26 TiB 
┊      1.75 TiB ┊ False        ┊ Ok    ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────╯

I've tried to restart linstor-satellite service on castle, but it didn't 
make any difference.

After a reboot of castle, and now I get this:

linstor resource list
╭────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node   ┊ Port ┊ Usage  ┊ Conns ┊ State ┊
╞════════════════════════════════════════════════════════════════════╡
┊ testvm1      ┊ castle ┊ 7000 ┊ Unused ┊ Ok    ┊ Diskless ┊
┊ testvm1      ┊ san5   ┊ 7000 ┊ Unused ┊ Ok    ┊ SyncTarget(55.99%) ┊
┊ testvm1      ┊ san6   ┊ 7000 ┊ Unused ┊ Ok    ┊ UpToDate ┊
╰────────────────────────────────────────────────────────────────────╯

However, looking at the err reports, and I see the exactl same error 
about creating the metadata on castle.

One interesting thing is that the LV seems to have been created:

lvs
   /dev/drbd0: open failed: Wrong medium type
   /dev/drbd1: open failed: Wrong medium type
   LV                            VG      Attr       LSize    Pool Origin 
Data%  Meta%  Move Log Cpy%Sync Convert
   backup_system_20200624_062513 storage swi-a-s---    4.00g system 3.06
   system                        storage owi-aos--- 5.00g
   testvm1_00000                 vg_hdd  -wi-a----- <500.11g

Any suggestions on where to look next? Or what I might have done wrong now?

Regards,
Adam





-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20200624/bc372144/attachment-0001.htm>


More information about the drbd-user mailing list