[DRBD-user] linstor failure

Adam Goryachev mailinglists at websitemanagers.com.au
Sun Jan 3 19:37:11 CET 2021


I have a small test setup with 2 x diskless linstor-satellite nodes, and 
4 x diskful linstor-satellite nodes, one of which is the linstor-controller.

The idea is that the diskless node is the compute node (xen, running the 
VM's whose data is on linstor resources).

I have 2 x test VM's, one which was (and still is) working OK (it's an 
older debian linux crossbowold), the other has failed (a Windows 10 VM 
jspiterivm1) while I was installing (attempting) the xen PV drivers (not 
sure if that is relevant or not). The other two resources are unused 
(ns2 and windows-wm).

I have a nothing relevant in the linstor error logs, but the linstor 
controller node has this in it's kern.log:

Dec 30 10:50:44 castle kernel: [4103630.414725] drbd windows-wm 
san6.mytest.com.au: sock was shut down by peer
Dec 30 10:50:44 castle kernel: [4103630.414752] drbd windows-wm 
san6.mytest.com.au: conn( Connected -> BrokenPipe ) peer( Secondary -> 
Unknown )
Dec 30 10:50:44 castle kernel: [4103630.414759] drbd windows-wm/0 
drbd1001 san6.mytest.com.au: pdsk( UpToDate -> DUnknown ) repl( 
Established -> Off )
Dec 30 10:50:44 castle kernel: [4103630.414807] drbd windows-wm 
san6.mytest.com.au: ack_receiver terminated
Dec 30 10:50:44 castle kernel: [4103630.414810] drbd windows-wm 
san6.mytest.com.au: Terminating ack_recv thread
Dec 30 10:50:44 castle kernel: [4103630.445961] drbd windows-wm 
san6.mytest.com.au: Restarting sender thread
Dec 30 10:50:44 castle kernel: [4103630.479708] drbd windows-wm 
san6.mytest.com.au: Connection closed
Dec 30 10:50:44 castle kernel: [4103630.479739] drbd windows-wm 
san6.mytest.com.au: helper command: /sbin/drbdadm disconnected
Dec 30 10:50:44 castle kernel: [4103630.486479] drbd windows-wm 
san6.mytest.com.au: helper command: /sbin/drbdadm disconnected exit code 0
Dec 30 10:50:44 castle kernel: [4103630.486533] drbd windows-wm 
san6.mytest.com.au: conn( BrokenPipe -> Unconnected )
Dec 30 10:50:44 castle kernel: [4103630.486556] drbd windows-wm 
san6.mytest.com.au: Restarting receiver thread
Dec 30 10:50:44 castle kernel: [4103630.486566] drbd windows-wm 
san6.mytest.com.au: conn( Unconnected -> Connecting )
Dec 30 10:50:44 castle kernel: [4103631.006727] drbd windows-wm 
san6.mytest.com.au: Handshake to peer 2 successful: Agreed network 
protocol version 117
Dec 30 10:50:44 castle kernel: [4103631.006735] drbd windows-wm 
san6.mytest.com.au: Feature flags enabled on protocol level: 0xf TRIM 
THIN_RESYNC WRITE_SAME WRITE_ZEROES.
Dec 30 10:50:44 castle kernel: [4103631.006918] drbd windows-wm 
san6.mytest.com.au: Peer authenticated using 20 bytes HMAC
Dec 30 10:50:44 castle kernel: [4103631.006943] drbd windows-wm 
san6.mytest.com.au: Starting ack_recv thread (from drbd_r_windows- [1164])
Dec 30 10:50:44 castle kernel: [4103631.041925] drbd windows-wm/0 
drbd1001 san6.mytest.com.au: drbd_sync_handshake:
Dec 30 10:50:44 castle kernel: [4103631.041932] drbd windows-wm/0 
drbd1001 san6.mytest.com.au: self 
CC647323743B5AE0:0000000000000000:0000000000000000:0000000000000000 
bits:0 flags:120
Dec 30 10:50:44 castle kernel: [4103631.041937] drbd windows-wm/0 
drbd1001 san6.mytest.com.au: peer 
CC647323743B5AE0:0000000000000000:0000000000000000:0000000000000000 
bits:0 flags:120
Dec 30 10:50:44 castle kernel: [4103631.041941] drbd windows-wm/0 
drbd1001 san6.mytest.com.au: uuid_compare()=no-sync by rule 38
Dec 30 10:50:44 castle kernel: [4103631.229931] drbd windows-wm: 
Preparing cluster-wide state change 1880606796 (0->2 499/146)
Dec 30 10:50:44 castle kernel: [4103631.230424] drbd windows-wm: State 
change 1880606796: primary_nodes=0, weak_nodes=0
Dec 30 10:50:44 castle kernel: [4103631.230429] drbd windows-wm: 
Committing cluster-wide state change 1880606796 (0ms)
Dec 30 10:50:44 castle kernel: [4103631.230480] drbd windows-wm 
san6.mytest.com.au: conn( Connecting -> Connected ) peer( Unknown -> 
Secondary )
Dec 30 10:50:44 castle kernel: [4103631.230486] drbd windows-wm/0 
drbd1001 san6.mytest.com.au: pdsk( DUnknown -> UpToDate ) repl( Off -> 
Established )
Dec 30 10:58:27 castle kernel: [4104093.577650] drbd jspiteriVM1 
xen1.mytest.com.au: peer( Primary -> Secondary )
Dec 30 10:58:27 castle kernel: [4104093.790062] drbd jspiteriVM1/0 
drbd1011: bitmap WRITE of 327 pages took 216 ms
Dec 30 10:58:39 castle kernel: [4104106.278699] drbd jspiteriVM1 
xen1.mytest.com.au: Preparing remote state change 490644362
Dec 30 10:58:39 castle kernel: [4104106.278984] drbd jspiteriVM1 
xen1.mytest.com.au: Committing remote state change 490644362 
(primary_nodes=10)
Dec 30 10:58:39 castle kernel: [4104106.278999] drbd jspiteriVM1 
xen1.mytest.com.au: peer( Secondary -> Primary )
Dec 30 10:58:40 castle kernel: [4104106.547178] drbd jspiteriVM1/0 
drbd1011 xen1.mytest.com.au: resync-susp( no -> connection dependency )
Dec 30 10:58:40 castle kernel: [4104106.547191] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: repl( PausedSyncT -> SyncTarget ) 
resync-susp( peer -> no )
Dec 30 10:58:40 castle kernel: [4104106.547198] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: Syncer continues.
Dec 30 11:04:29 castle kernel: [4104456.362585] drbd jspiteriVM1 
xen1.mytest.com.au: peer( Primary -> Secondary )
Dec 30 11:04:30 castle kernel: [4104456.388543] drbd jspiteriVM1/0 
drbd1011: bitmap WRITE of 1 pages took 24 ms
Dec 30 11:04:30 castle kernel: [4104456.401108] drbd jspiteriVM1/0 
drbd1011 san6.mytest.com.au: pdsk( UpToDate -> Outdated )
Dec 30 11:04:30 castle kernel: [4104456.788360] drbd jspiteriVM1/0 
drbd1011 san6.mytest.com.au: pdsk( Outdated -> Inconsistent )
Dec 30 11:09:15 castle kernel: [4104742.275721] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=2
Dec 30 11:09:15 castle kernel: [4104742.377977] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=2
Dec 30 11:09:16 castle kernel: [4104742.481920] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=3
Dec 30 11:09:16 castle kernel: [4104742.585933] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=4
Dec 30 11:09:16 castle kernel: [4104742.689909] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5
Dec 30 11:09:16 castle kernel: [4104742.793898] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5
Dec 30 11:09:16 castle kernel: [4104742.897895] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5
Dec 30 11:09:16 castle kernel: [4104743.001927] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5
Dec 30 11:09:16 castle kernel: [4104743.105909] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5
Dec 30 11:09:16 castle kernel: [4104743.209908] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5
Dec 30 11:09:16 castle kernel: [4104743.313927] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5
Dec 30 11:09:17 castle kernel: [4104743.417897] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5
Dec 30 11:09:17 castle kernel: [4104743.521909] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5
Dec 30 11:09:17 castle kernel: [4104743.575764] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5
Dec 30 11:09:17 castle kernel: [4104743.625902] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5
Dec 30 11:09:17 castle kernel: [4104743.729908] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5
Dec 30 11:09:17 castle kernel: [4104743.833894] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5
Dec 30 11:09:17 castle kernel: [4104743.937890] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5
Dec 30 11:09:17 castle kernel: [4104744.041907] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5
[this line repeats .... until Jan 2 2:33am, probably when I rebooted it]

Jan  2 02:33:46 castle kernel: [4333012.494110] drbd jspiteriVM1 
san5.mytest.com.au: Restarting sender thread
Jan  2 02:33:46 castle kernel: [4333012.528437] drbd jspiteriVM1 
san5.mytest.com.au: Connection closed
Jan  2 02:33:46 castle kernel: [4333012.528447] drbd jspiteriVM1 
san5.mytest.com.au: helper command: /sbin/drbdadm disconnected
Jan  2 02:33:46 castle kernel: [4333012.530942] drbd jspiteriVM1 
san5.mytest.com.au: helper command: /sbin/drbdadm disconnected exit code 0
Jan  2 02:33:46 castle kernel: [4333012.530960] drbd jspiteriVM1 
san5.mytest.com.au: conn( BrokenPipe -> Unconnected )
Jan  2 02:33:46 castle kernel: [4333012.530970] drbd jspiteriVM1 
san5.mytest.com.au: Restarting receiver thread
Jan  2 02:33:46 castle kernel: [4333012.530974] drbd jspiteriVM1 
san5.mytest.com.au: conn( Unconnected -> Connecting )
Jan  2 02:33:46 castle kernel: [4333013.054060] drbd jspiteriVM1 
san5.mytest.com.au: Handshake to peer 1 successful: Agreed network 
protocol version 117
Jan  2 02:33:46 castle kernel: [4333013.054067] drbd jspiteriVM1 
san5.mytest.com.au: Feature flags enabled on protocol level: 0xf TRIM 
THIN_RESYNC WRITE_SAME WRITE_ZEROES.
Jan  2 02:33:46 castle kernel: [4333013.054426] drbd jspiteriVM1 
san5.mytest.com.au: Peer authenticated using 20 bytes HMAC
Jan  2 02:33:46 castle kernel: [4333013.054452] drbd jspiteriVM1 
san5.mytest.com.au: Starting ack_recv thread (from drbd_r_jspiteri [1046])
Jan  2 02:33:46 castle kernel: [4333013.085933] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: drbd_sync_handshake:
Jan  2 02:33:46 castle kernel: [4333013.085941] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: self 
122E90789B3D90E2:122E90789B3D90E3:4D2D1C8F63C38B44:B1B847713A96996E 
bits:21168661 flags:124
Jan  2 02:33:46 castle kernel: [4333013.085946] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: peer 
2B520E804A7D4EAC:0000000000000000:4D2D1C8F63C38B44:B1B847713A96996E 
bits:21168661 flags:124
Jan  2 02:33:46 castle kernel: [4333013.085952] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: uuid_compare()=target-set-bitmap by rule 60
Jan  2 02:33:46 castle kernel: [4333013.085956] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: Setting and writing one bitmap slot, after 
drbd_sync_handshake
Jan  2 02:33:46 castle kernel: [4333013.226948] drbd jspiteriVM1/0 
drbd1011: bitmap WRITE of 1078 pages took 88 ms
Jan  2 02:33:46 castle kernel: [4333013.278401] drbd jspiteriVM1: 
Preparing cluster-wide state change 3482568163 (0->1 499/146)
Jan  2 02:33:46 castle kernel: [4333013.278980] drbd jspiteriVM1: State 
change 3482568163: primary_nodes=0, weak_nodes=0
Jan  2 02:33:46 castle kernel: [4333013.278985] drbd jspiteriVM1: 
Committing cluster-wide state change 3482568163 (0ms)
Jan  2 02:33:46 castle kernel: [4333013.279050] drbd jspiteriVM1 
san5.mytest.com.au: conn( Connecting -> Connected ) peer( Unknown -> 
Secondary )
Jan  2 02:33:46 castle kernel: [4333013.279055] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: repl( Off -> WFBitMapT )
Jan  2 02:33:46 castle kernel: [4333013.326494] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: receive bitmap stats [Bytes(packets)]: 
plain 0(0), RLE 23(1), total 23; compression: 100.0%
Jan  2 02:33:46 castle kernel: [4333013.337300] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: send bitmap stats [Bytes(packets)]: plain 
0(0), RLE 23(1), total 23; compression: 100.0%
Jan  2 02:33:46 castle kernel: [4333013.337313] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: helper command: /sbin/drbdadm 
before-resync-target
Jan  2 02:33:46 castle kernel: [4333013.339475] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: helper command: /sbin/drbdadm 
before-resync-target exit code 0
Jan  2 02:33:46 castle kernel: [4333013.339503] drbd jspiteriVM1/0 
drbd1011 xen1.mytest.com.au: resync-susp( no -> connection dependency )
Jan  2 02:33:46 castle kernel: [4333013.339504] drbd jspiteriVM1/0 
drbd1011 san7.mytest.com.au: resync-susp( no -> connection dependency )
Jan  2 02:33:46 castle kernel: [4333013.339505] drbd jspiteriVM1/0 
drbd1011 san6.mytest.com.au: resync-susp( no -> connection dependency )
Jan  2 02:33:46 castle kernel: [4333013.339507] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: repl( WFBitMapT -> SyncTarget )
Jan  2 02:33:46 castle kernel: [4333013.339552] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: Began resync as SyncTarget (will sync 
104859732 KB [26214933 bits set]).
Jan  2 02:50:55 castle kernel: [4334042.151194] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=2
Jan  2 02:50:55 castle kernel: [4334042.254225] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: Resync done (total 1028 sec; paused 0 sec; 
102000 K/sec)
Jan  2 02:50:55 castle kernel: [4334042.254230] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: expected n_oos:23691797 to be equal to 
rs_failed:23727152
Jan  2 02:50:55 castle kernel: [4334042.254232] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au:             23727152 failed blocks
Jan  2 02:50:55 castle kernel: [4334042.254245] drbd jspiteriVM1/0 
drbd1011 xen1.mytest.com.au: resync-susp( connection dependency -> no )
Jan  2 02:50:55 castle kernel: [4334042.254247] drbd jspiteriVM1/0 
drbd1011 san7.mytest.com.au: resync-susp( connection dependency -> no )
Jan  2 02:50:55 castle kernel: [4334042.254249] drbd jspiteriVM1/0 
drbd1011 san6.mytest.com.au: resync-susp( connection dependency -> no )
Jan  2 02:50:55 castle kernel: [4334042.254252] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: pdsk( Outdated -> UpToDate ) repl( 
SyncTarget -> Established )
Jan  2 02:50:55 castle kernel: [4334042.281495] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: helper command: /sbin/drbdadm 
after-resync-target
Jan  2 02:50:55 castle kernel: [4334042.289879] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: helper command: /sbin/drbdadm 
after-resync-target exit code 0
Jan  2 02:50:55 castle kernel: [4334042.289879] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: pdsk( UpToDate -> Inconsistent )
Jan  2 10:23:28 castle kernel: [4361194.855074] drbd windows-wm 
san7.mytest.com.au: sock was shut down by peer
Jan  2 10:23:28 castle kernel: [4361194.855101] drbd windows-wm 
san7.mytest.com.au: conn( Connected -> BrokenPipe ) peer( Secondary -> 
Unknown )
Jan  2 10:23:28 castle kernel: [4361194.855109] drbd windows-wm/0 
drbd1001 san7.mytest.com.au: pdsk( UpToDate -> DUnknown ) repl( 
Established -> Off )
Jan  2 10:23:28 castle kernel: [4361194.855161] drbd windows-wm 
san7.mytest.com.au: ack_receiver terminated
Jan  2 10:23:28 castle kernel: [4361194.855164] drbd windows-wm 
san7.mytest.com.au: Terminating ack_recv thread
Jan  2 10:23:28 castle kernel: [4361194.882138] drbd windows-wm 
san7.mytest.com.au: Restarting sender thread
Jan  2 10:23:28 castle kernel: [4361194.961402] drbd windows-wm 
san7.mytest.com.au: Connection closed
Jan  2 10:23:28 castle kernel: [4361194.961435] drbd windows-wm 
san7.mytest.com.au: helper command: /sbin/drbdadm disconnected
Jan  2 10:23:28 castle kernel: [4361194.968763] drbd windows-wm 
san7.mytest.com.au: helper command: /sbin/drbdadm disconnected exit code 0
Jan  2 10:23:28 castle kernel: [4361194.968800] drbd windows-wm 
san7.mytest.com.au: conn( BrokenPipe -> Unconnected )
Jan  2 10:23:28 castle kernel: [4361194.968812] drbd windows-wm 
san7.mytest.com.au: Restarting receiver thread
Jan  2 10:23:28 castle kernel: [4361194.968816] drbd windows-wm 
san7.mytest.com.au: conn( Unconnected -> Connecting )
Jan  2 10:23:29 castle kernel: [4361195.486059] drbd windows-wm 
san7.mytest.com.au: Handshake to peer 3 successful: Agreed network 
protocol version 117
Jan  2 10:23:29 castle kernel: [4361195.486066] drbd windows-wm 
san7.mytest.com.au: Feature flags enabled on protocol level: 0xf TRIM 
THIN_RESYNC WRITE_SAME WRITE_ZEROES.
Jan  2 10:23:29 castle kernel: [4361195.486490] drbd windows-wm 
san7.mytest.com.au: Peer authenticated using 20 bytes HMAC
Jan  2 10:23:29 castle kernel: [4361195.486515] drbd windows-wm 
san7.mytest.com.au: Starting ack_recv thread (from drbd_r_windows- [1165])
Jan  2 10:23:29 castle kernel: [4361195.517928] drbd windows-wm/0 
drbd1001 san7.mytest.com.au: drbd_sync_handshake:
Jan  2 10:23:29 castle kernel: [4361195.517935] drbd windows-wm/0 
drbd1001 san7.mytest.com.au: self 
CC647323743B5AE0:0000000000000000:0000000000000000:0000000000000000 
bits:0 flags:120
Jan  2 10:23:29 castle kernel: [4361195.517940] drbd windows-wm/0 
drbd1001 san7.mytest.com.au: peer 
CC647323743B5AE0:0000000000000000:0000000000000000:0000000000000000 
bits:0 flags:120
Jan  2 10:23:29 castle kernel: [4361195.517944] drbd windows-wm/0 
drbd1001 san7.mytest.com.au: uuid_compare()=no-sync by rule 38
Jan  2 10:23:29 castle kernel: [4361195.677932] drbd windows-wm: 
Preparing cluster-wide state change 3667329610 (0->3 499/146)
Jan  2 10:23:29 castle kernel: [4361195.678459] drbd windows-wm: State 
change 3667329610: primary_nodes=0, weak_nodes=0
Jan  2 10:23:29 castle kernel: [4361195.678466] drbd windows-wm: 
Committing cluster-wide state change 3667329610 (0ms)
Jan  2 10:23:29 castle kernel: [4361195.678516] drbd windows-wm 
san7.mytest.com.au: conn( Connecting -> Connected ) peer( Unknown -> 
Secondary )
Jan  2 10:23:29 castle kernel: [4361195.678522] drbd windows-wm/0 
drbd1001 san7.mytest.com.au: pdsk( DUnknown -> UpToDate ) repl( Off -> 
Established )

castle:/var/log# linstor resource list
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node   ┊ Port ┊ Usage  ┊ Conns ┊             State ┊ 
CreatedOn           ┊
╞══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ crossbowold  ┊ castle ┊ 7010 ┊ Unused ┊ Ok   ┊          UpToDate ┊ 
2020-10-07 00:46:23 ┊
┊ crossbowold  ┊ flail  ┊ 7010 ┊ Unused ┊ Ok ┊          Diskless ┊ 
2021-01-04 05:03:20 ┊
┊ crossbowold  ┊ san5   ┊ 7010 ┊ Unused ┊ Ok ┊          UpToDate ┊ 
2020-10-07 00:46:23 ┊
┊ crossbowold  ┊ san6   ┊ 7010 ┊ Unused ┊ Ok    ┊          UpToDate ┊ 
2020-10-07 00:46:22 ┊
┊ crossbowold  ┊ san7   ┊ 7010 ┊ Unused ┊ Ok ┊          UpToDate ┊ 
2020-10-07 00:46:21 ┊
┊ crossbowold  ┊ xen1   ┊ 7010 ┊ InUse  ┊ Ok ┊          Diskless ┊ 
2020-10-15 00:30:31 ┊
┊ jspiteriVM1  ┊ castle ┊ 7011 ┊ Unused ┊ 
StandAlone(san6.mytest.com.au,san7.mytest.com.au)    ┊ SyncTarget(0.00%) 
┊ 2020-10-14 22:15:00 ┊
┊ jspiteriVM1  ┊ san5   ┊ 7011 ┊ Unused ┊ Connecting(san7.mytest.com.au) 
   ┊      Inconsistent ┊ 2020-10-14 22:14:59 ┊
┊ jspiteriVM1  ┊ san6   ┊ 7011 ┊ Unused ┊ 
Connecting(castle.mytest.com.au,san7.mytest.com.au) ┊ SyncTarget(0.00%) 
┊ 2020-10-14 22:14:58 ┊
┊ jspiteriVM1  ┊ san7   ┊ 7011 ┊ Unused ┊ 
Connecting(castle.mytest.com.au),StandAlone(san6.mytest.com.au,san5.mytest.com.au) 
┊      Inconsistent ┊ 2020-10-14 22:14:58 ┊
┊ jspiteriVM1  ┊ xen1   ┊ 7011 ┊ Unused ┊ Ok ┊          Diskless ┊ 
2020-11-20 20:39:20 ┊
┊ ns2          ┊ castle ┊ 7000 ┊ Unused ┊ Ok ┊          UpToDate ┊ 
2020-10-28 23:22:13 ┊
┊ ns2          ┊ flail  ┊ 7000 ┊ Unused ┊ Ok ┊          Diskless ┊ 
2021-01-04 05:03:42 ┊
┊ ns2          ┊ san5   ┊ 7000 ┊ Unused ┊ Ok ┊          UpToDate ┊ 
2020-10-28 23:22:12 ┊
┊ ns2          ┊ san6   ┊ 7000 ┊ Unused ┊ Ok    ┊          UpToDate ┊ 
2020-10-28 23:22:11 ┊
┊ ns2          ┊ xen1   ┊ 7000 ┊ Unused ┊ Ok ┊          Diskless ┊ 
2020-10-28 23:30:20 ┊
┊ windows-wm   ┊ castle ┊ 7001 ┊ Unused ┊ Ok ┊          UpToDate ┊ 
2020-09-30 00:03:41 ┊
┊ windows-wm   ┊ flail  ┊ 7001 ┊ Unused ┊ Ok ┊          Diskless ┊ 
2021-01-04 05:03:48 ┊
┊ windows-wm   ┊ san5   ┊ 7001 ┊ Unused ┊ Ok ┊          UpToDate ┊ 
2020-09-30 00:03:40 ┊
┊ windows-wm   ┊ san6   ┊ 7001 ┊ Unused ┊ Ok ┊          UpToDate ┊ 
2020-09-30 00:03:39 ┊
┊ windows-wm   ┊ san7   ┊ 7001 ┊ Unused ┊ Ok    ┊          UpToDate ┊ 
2020-09-30 00:13:05 ┊
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Could anyone determine from this, or advise what additional logs I 
should examine, to work out why this failed? I don't see anything 
obvious as to what caused linstor/drbd to fail here, all nodes where 
online and un-interrupted as far as I can tell. All physical storage is 
backed by MD raid arrays, so again there is some protection against disk 
failures (haven't noticed any anyway though).

I've since done a upgrade to the latest version of drbd/linstor 
components on all nodes.

Finally, what could I do to recover the data? Has it been destroyed, or 
do I just need to select a node and tell lintor that this node has up to 
date data? Or can linstor work that out somehow?

Regards,
Adam



More information about the drbd-user mailing list