[DRBD-user] linstor failure

Adam Goryachev mailinglists at websitemanagers.com.au
Wed Feb 17 03:20:38 CET 2021


Reposting the below as I guess early January wasn't the best time to get 
any responses. I'd really appreciate any assistance as I'd prefer to 
avoid rebuilding the VM from scratch (wasted hours, not lost data), but 
also I'd like to know how to resolve or avoid the issue in the future 
when I actually have real data being stored.

Thanks,
Adam


I have a small test setup with 2 x diskless linstor-satellite nodes, and 
4 x diskful linstor-satellite nodes, one of which is the linstor-controller.


The idea is that the diskless node is the compute node (xen, running the 
VM's whose data is on linstor resources).

I have 2 x test VM's, one which was (and still is) working OK (it's an 
older debian linux crossbowold), the other has failed (a Windows 10 VM 
jspiterivm1) while I was installing (attempting) the xen PV drivers (not 
sure if that is relevant or not). The other two resources are unused 
(ns2 and windows-wm).

I have a nothing relevant in the linstor error logs, but the linstor 
controller node has this in it's kern.log:

Dec 30 10:50:44 castle kernel: [4103630.414725] drbd windows-wm 
san6.mytest.com.au: sock was shut down by peer
Dec 30 10:50:44 castle kernel: [4103630.414752] drbd windows-wm 
san6.mytest.com.au: conn( Connected -> BrokenPipe ) peer( Secondary -> 
Unknown )
Dec 30 10:50:44 castle kernel: [4103630.414759] drbd windows-wm/0 
drbd1001 san6.mytest.com.au: pdsk( UpToDate -> DUnknown ) repl( 
Established -> Off )
Dec 30 10:50:44 castle kernel: [4103630.414807] drbd windows-wm 
san6.mytest.com.au: ack_receiver terminated
Dec 30 10:50:44 castle kernel: [4103630.414810] drbd windows-wm 
san6.mytest.com.au: Terminating ack_recv thread
Dec 30 10:50:44 castle kernel: [4103630.445961] drbd windows-wm 
san6.mytest.com.au: Restarting sender thread
Dec 30 10:50:44 castle kernel: [4103630.479708] drbd windows-wm 
san6.mytest.com.au: Connection closed
Dec 30 10:50:44 castle kernel: [4103630.479739] drbd windows-wm 
san6.mytest.com.au: helper command: /sbin/drbdadm disconnected
Dec 30 10:50:44 castle kernel: [4103630.486479] drbd windows-wm 
san6.mytest.com.au: helper command: /sbin/drbdadm disconnected exit code 0
Dec 30 10:50:44 castle kernel: [4103630.486533] drbd windows-wm 
san6.mytest.com.au: conn( BrokenPipe -> Unconnected )
Dec 30 10:50:44 castle kernel: [4103630.486556] drbd windows-wm 
san6.mytest.com.au: Restarting receiver thread
Dec 30 10:50:44 castle kernel: [4103630.486566] drbd windows-wm 
san6.mytest.com.au: conn( Unconnected -> Connecting )
Dec 30 10:50:44 castle kernel: [4103631.006727] drbd windows-wm 
san6.mytest.com.au: Handshake to peer 2 successful: Agreed network 
protocol version 117
Dec 30 10:50:44 castle kernel: [4103631.006735] drbd windows-wm 
san6.mytest.com.au: Feature flags enabled on protocol level: 0xf TRIM 
THIN_RESYNC WRITE_SAME WRITE_ZEROES.
Dec 30 10:50:44 castle kernel: [4103631.006918] drbd windows-wm 
san6.mytest.com.au: Peer authenticated using 20 bytes HMAC
Dec 30 10:50:44 castle kernel: [4103631.006943] drbd windows-wm 
san6.mytest.com.au: Starting ack_recv thread (from drbd_r_windows- [1164])
Dec 30 10:50:44 castle kernel: [4103631.041925] drbd windows-wm/0 
drbd1001 san6.mytest.com.au: drbd_sync_handshake:
Dec 30 10:50:44 castle kernel: [4103631.041932] drbd windows-wm/0 
drbd1001 san6.mytest.com.au: self 
CC647323743B5AE0:0000000000000000:0000000000000000:0000000000000000 
bits:0 flags:120
Dec 30 10:50:44 castle kernel: [4103631.041937] drbd windows-wm/0 
drbd1001 san6.mytest.com.au: peer 
CC647323743B5AE0:0000000000000000:0000000000000000:0000000000000000 
bits:0 flags:120
Dec 30 10:50:44 castle kernel: [4103631.041941] drbd windows-wm/0 
drbd1001 san6.mytest.com.au: uuid_compare()=no-sync by rule 38
Dec 30 10:50:44 castle kernel: [4103631.229931] drbd windows-wm: 
Preparing cluster-wide state change 1880606796 (0->2 499/146)
Dec 30 10:50:44 castle kernel: [4103631.230424] drbd windows-wm: State 
change 1880606796: primary_nodes=0, weak_nodes=0
Dec 30 10:50:44 castle kernel: [4103631.230429] drbd windows-wm: 
Committing cluster-wide state change 1880606796 (0ms)
Dec 30 10:50:44 castle kernel: [4103631.230480] drbd windows-wm 
san6.mytest.com.au: conn( Connecting -> Connected ) peer( Unknown -> 
Secondary )
Dec 30 10:50:44 castle kernel: [4103631.230486] drbd windows-wm/0 
drbd1001 san6.mytest.com.au: pdsk( DUnknown -> UpToDate ) repl( Off -> 
Established )
Dec 30 10:58:27 castle kernel: [4104093.577650] drbd jspiteriVM1 
xen1.mytest.com.au: peer( Primary -> Secondary )
Dec 30 10:58:27 castle kernel: [4104093.790062] drbd jspiteriVM1/0 
drbd1011: bitmap WRITE of 327 pages took 216 ms
Dec 30 10:58:39 castle kernel: [4104106.278699] drbd jspiteriVM1 
xen1.mytest.com.au: Preparing remote state change 490644362
Dec 30 10:58:39 castle kernel: [4104106.278984] drbd jspiteriVM1 
xen1.mytest.com.au: Committing remote state change 490644362 
(primary_nodes=10)
Dec 30 10:58:39 castle kernel: [4104106.278999] drbd jspiteriVM1 
xen1.mytest.com.au: peer( Secondary -> Primary )
Dec 30 10:58:40 castle kernel: [4104106.547178] drbd jspiteriVM1/0 
drbd1011 xen1.mytest.com.au: resync-susp( no -> connection dependency )
Dec 30 10:58:40 castle kernel: [4104106.547191] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: repl( PausedSyncT -> SyncTarget ) 
resync-susp( peer -> no )
Dec 30 10:58:40 castle kernel: [4104106.547198] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: Syncer continues.
Dec 30 11:04:29 castle kernel: [4104456.362585] drbd jspiteriVM1 
xen1.mytest.com.au: peer( Primary -> Secondary )
Dec 30 11:04:30 castle kernel: [4104456.388543] drbd jspiteriVM1/0 
drbd1011: bitmap WRITE of 1 pages took 24 ms
Dec 30 11:04:30 castle kernel: [4104456.401108] drbd jspiteriVM1/0 
drbd1011 san6.mytest.com.au: pdsk( UpToDate -> Outdated )
Dec 30 11:04:30 castle kernel: [4104456.788360] drbd jspiteriVM1/0 
drbd1011 san6.mytest.com.au: pdsk( Outdated -> Inconsistent )
Dec 30 11:09:15 castle kernel: [4104742.275721] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=2
Dec 30 11:09:15 castle kernel: [4104742.377977] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=2
Dec 30 11:09:16 castle kernel: [4104742.481920] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=3
Dec 30 11:09:16 castle kernel: [4104742.585933] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=4
Dec 30 11:09:16 castle kernel: [4104742.689909] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5
Dec 30 11:09:16 castle kernel: [4104742.793898] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5
Dec 30 11:09:16 castle kernel: [4104742.897895] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5
Dec 30 11:09:16 castle kernel: [4104743.001927] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5
Dec 30 11:09:16 castle kernel: [4104743.105909] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5
Dec 30 11:09:16 castle kernel: [4104743.209908] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5
Dec 30 11:09:16 castle kernel: [4104743.313927] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5
Dec 30 11:09:17 castle kernel: [4104743.417897] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5
Dec 30 11:09:17 castle kernel: [4104743.521909] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5
Dec 30 11:09:17 castle kernel: [4104743.575764] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5
Dec 30 11:09:17 castle kernel: [4104743.625902] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5
Dec 30 11:09:17 castle kernel: [4104743.729908] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5
Dec 30 11:09:17 castle kernel: [4104743.833894] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5
Dec 30 11:09:17 castle kernel: [4104743.937890] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5
Dec 30 11:09:17 castle kernel: [4104744.041907] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5
[this line repeats .... until Jan 2 2:33am, probably when I rebooted it]

Jan  2 02:33:46 castle kernel: [4333012.494110] drbd jspiteriVM1 
san5.mytest.com.au: Restarting sender thread
Jan  2 02:33:46 castle kernel: [4333012.528437] drbd jspiteriVM1 
san5.mytest.com.au: Connection closed
Jan  2 02:33:46 castle kernel: [4333012.528447] drbd jspiteriVM1 
san5.mytest.com.au: helper command: /sbin/drbdadm disconnected
Jan  2 02:33:46 castle kernel: [4333012.530942] drbd jspiteriVM1 
san5.mytest.com.au: helper command: /sbin/drbdadm disconnected exit code 0
Jan  2 02:33:46 castle kernel: [4333012.530960] drbd jspiteriVM1 
san5.mytest.com.au: conn( BrokenPipe -> Unconnected )
Jan  2 02:33:46 castle kernel: [4333012.530970] drbd jspiteriVM1 
san5.mytest.com.au: Restarting receiver thread
Jan  2 02:33:46 castle kernel: [4333012.530974] drbd jspiteriVM1 
san5.mytest.com.au: conn( Unconnected -> Connecting )
Jan  2 02:33:46 castle kernel: [4333013.054060] drbd jspiteriVM1 
san5.mytest.com.au: Handshake to peer 1 successful: Agreed network 
protocol version 117
Jan  2 02:33:46 castle kernel: [4333013.054067] drbd jspiteriVM1 
san5.mytest.com.au: Feature flags enabled on protocol level: 0xf TRIM 
THIN_RESYNC WRITE_SAME WRITE_ZEROES.
Jan  2 02:33:46 castle kernel: [4333013.054426] drbd jspiteriVM1 
san5.mytest.com.au: Peer authenticated using 20 bytes HMAC
Jan  2 02:33:46 castle kernel: [4333013.054452] drbd jspiteriVM1 
san5.mytest.com.au: Starting ack_recv thread (from drbd_r_jspiteri [1046])
Jan  2 02:33:46 castle kernel: [4333013.085933] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: drbd_sync_handshake:
Jan  2 02:33:46 castle kernel: [4333013.085941] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: self 
122E90789B3D90E2:122E90789B3D90E3:4D2D1C8F63C38B44:B1B847713A96996E 
bits:21168661 flags:124
Jan  2 02:33:46 castle kernel: [4333013.085946] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: peer 
2B520E804A7D4EAC:0000000000000000:4D2D1C8F63C38B44:B1B847713A96996E 
bits:21168661 flags:124
Jan  2 02:33:46 castle kernel: [4333013.085952] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: uuid_compare()=target-set-bitmap by rule 60
Jan  2 02:33:46 castle kernel: [4333013.085956] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: Setting and writing one bitmap slot, after 
drbd_sync_handshake
Jan  2 02:33:46 castle kernel: [4333013.226948] drbd jspiteriVM1/0 
drbd1011: bitmap WRITE of 1078 pages took 88 ms
Jan  2 02:33:46 castle kernel: [4333013.278401] drbd jspiteriVM1: 
Preparing cluster-wide state change 3482568163 (0->1 499/146)
Jan  2 02:33:46 castle kernel: [4333013.278980] drbd jspiteriVM1: State 
change 3482568163: primary_nodes=0, weak_nodes=0
Jan  2 02:33:46 castle kernel: [4333013.278985] drbd jspiteriVM1: 
Committing cluster-wide state change 3482568163 (0ms)
Jan  2 02:33:46 castle kernel: [4333013.279050] drbd jspiteriVM1 
san5.mytest.com.au: conn( Connecting -> Connected ) peer( Unknown -> 
Secondary )
Jan  2 02:33:46 castle kernel: [4333013.279055] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: repl( Off -> WFBitMapT )
Jan  2 02:33:46 castle kernel: [4333013.326494] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: receive bitmap stats [Bytes(packets)]: 
plain 0(0), RLE 23(1), total 23; compression: 100.0%
Jan  2 02:33:46 castle kernel: [4333013.337300] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: send bitmap stats [Bytes(packets)]: plain 
0(0), RLE 23(1), total 23; compression: 100.0%
Jan  2 02:33:46 castle kernel: [4333013.337313] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: helper command: /sbin/drbdadm 
before-resync-target
Jan  2 02:33:46 castle kernel: [4333013.339475] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: helper command: /sbin/drbdadm 
before-resync-target exit code 0
Jan  2 02:33:46 castle kernel: [4333013.339503] drbd jspiteriVM1/0 
drbd1011 xen1.mytest.com.au: resync-susp( no -> connection dependency )
Jan  2 02:33:46 castle kernel: [4333013.339504] drbd jspiteriVM1/0 
drbd1011 san7.mytest.com.au: resync-susp( no -> connection dependency )
Jan  2 02:33:46 castle kernel: [4333013.339505] drbd jspiteriVM1/0 
drbd1011 san6.mytest.com.au: resync-susp( no -> connection dependency )
Jan  2 02:33:46 castle kernel: [4333013.339507] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: repl( WFBitMapT -> SyncTarget )
Jan  2 02:33:46 castle kernel: [4333013.339552] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: Began resync as SyncTarget (will sync 
104859732 KB [26214933 bits set]).
Jan  2 02:50:55 castle kernel: [4334042.151194] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=2
Jan  2 02:50:55 castle kernel: [4334042.254225] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: Resync done (total 1028 sec; paused 0 sec; 
102000 K/sec)
Jan  2 02:50:55 castle kernel: [4334042.254230] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: expected n_oos:23691797 to be equal to 
rs_failed:23727152
Jan  2 02:50:55 castle kernel: [4334042.254232] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au:             23727152 failed blocks
Jan  2 02:50:55 castle kernel: [4334042.254245] drbd jspiteriVM1/0 
drbd1011 xen1.mytest.com.au: resync-susp( connection dependency -> no )
Jan  2 02:50:55 castle kernel: [4334042.254247] drbd jspiteriVM1/0 
drbd1011 san7.mytest.com.au: resync-susp( connection dependency -> no )
Jan  2 02:50:55 castle kernel: [4334042.254249] drbd jspiteriVM1/0 
drbd1011 san6.mytest.com.au: resync-susp( connection dependency -> no )
Jan  2 02:50:55 castle kernel: [4334042.254252] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: pdsk( Outdated -> UpToDate ) repl( 
SyncTarget -> Established )
Jan  2 02:50:55 castle kernel: [4334042.281495] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: helper command: /sbin/drbdadm 
after-resync-target
Jan  2 02:50:55 castle kernel: [4334042.289879] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: helper command: /sbin/drbdadm 
after-resync-target exit code 0
Jan  2 02:50:55 castle kernel: [4334042.289879] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: pdsk( UpToDate -> Inconsistent )
Jan  2 10:23:28 castle kernel: [4361194.855074] drbd windows-wm 
san7.mytest.com.au: sock was shut down by peer
Jan  2 10:23:28 castle kernel: [4361194.855101] drbd windows-wm 
san7.mytest.com.au: conn( Connected -> BrokenPipe ) peer( Secondary -> 
Unknown )
Jan  2 10:23:28 castle kernel: [4361194.855109] drbd windows-wm/0 
drbd1001 san7.mytest.com.au: pdsk( UpToDate -> DUnknown ) repl( 
Established -> Off )
Jan  2 10:23:28 castle kernel: [4361194.855161] drbd windows-wm 
san7.mytest.com.au: ack_receiver terminated
Jan  2 10:23:28 castle kernel: [4361194.855164] drbd windows-wm 
san7.mytest.com.au: Terminating ack_recv thread
Jan  2 10:23:28 castle kernel: [4361194.882138] drbd windows-wm 
san7.mytest.com.au: Restarting sender thread
Jan  2 10:23:28 castle kernel: [4361194.961402] drbd windows-wm 
san7.mytest.com.au: Connection closed
Jan  2 10:23:28 castle kernel: [4361194.961435] drbd windows-wm 
san7.mytest.com.au: helper command: /sbin/drbdadm disconnected
Jan  2 10:23:28 castle kernel: [4361194.968763] drbd windows-wm 
san7.mytest.com.au: helper command: /sbin/drbdadm disconnected exit code 0
Jan  2 10:23:28 castle kernel: [4361194.968800] drbd windows-wm 
san7.mytest.com.au: conn( BrokenPipe -> Unconnected )
Jan  2 10:23:28 castle kernel: [4361194.968812] drbd windows-wm 
san7.mytest.com.au: Restarting receiver thread
Jan  2 10:23:28 castle kernel: [4361194.968816] drbd windows-wm 
san7.mytest.com.au: conn( Unconnected -> Connecting )
Jan  2 10:23:29 castle kernel: [4361195.486059] drbd windows-wm 
san7.mytest.com.au: Handshake to peer 3 successful: Agreed network 
protocol version 117
Jan  2 10:23:29 castle kernel: [4361195.486066] drbd windows-wm 
san7.mytest.com.au: Feature flags enabled on protocol level: 0xf TRIM 
THIN_RESYNC WRITE_SAME WRITE_ZEROES.
Jan  2 10:23:29 castle kernel: [4361195.486490] drbd windows-wm 
san7.mytest.com.au: Peer authenticated using 20 bytes HMAC
Jan  2 10:23:29 castle kernel: [4361195.486515] drbd windows-wm 
san7.mytest.com.au: Starting ack_recv thread (from drbd_r_windows- [1165])
Jan  2 10:23:29 castle kernel: [4361195.517928] drbd windows-wm/0 
drbd1001 san7.mytest.com.au: drbd_sync_handshake:
Jan  2 10:23:29 castle kernel: [4361195.517935] drbd windows-wm/0 
drbd1001 san7.mytest.com.au: self 
CC647323743B5AE0:0000000000000000:0000000000000000:0000000000000000 
bits:0 flags:120
Jan  2 10:23:29 castle kernel: [4361195.517940] drbd windows-wm/0 
drbd1001 san7.mytest.com.au: peer 
CC647323743B5AE0:0000000000000000:0000000000000000:0000000000000000 
bits:0 flags:120
Jan  2 10:23:29 castle kernel: [4361195.517944] drbd windows-wm/0 
drbd1001 san7.mytest.com.au: uuid_compare()=no-sync by rule 38
Jan  2 10:23:29 castle kernel: [4361195.677932] drbd windows-wm: 
Preparing cluster-wide state change 3667329610 (0->3 499/146)
Jan  2 10:23:29 castle kernel: [4361195.678459] drbd windows-wm: State 
change 3667329610: primary_nodes=0, weak_nodes=0
Jan  2 10:23:29 castle kernel: [4361195.678466] drbd windows-wm: 
Committing cluster-wide state change 3667329610 (0ms)
Jan  2 10:23:29 castle kernel: [4361195.678516] drbd windows-wm 
san7.mytest.com.au: conn( Connecting -> Connected ) peer( Unknown -> 
Secondary )
Jan  2 10:23:29 castle kernel: [4361195.678522] drbd windows-wm/0 
drbd1001 san7.mytest.com.au: pdsk( DUnknown -> UpToDate ) repl( Off -> 
Established )

castle:/var/log# linstor resource list
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node   ┊ Port ┊ Usage  ┊ Conns ┊             State ┊ 
CreatedOn           ┊
╞══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ crossbowold  ┊ castle ┊ 7010 ┊ Unused ┊ Ok   ┊          UpToDate ┊ 
2020-10-07 00:46:23 ┊
┊ crossbowold  ┊ flail  ┊ 7010 ┊ Unused ┊ Ok ┊          Diskless ┊ 
2021-01-04 05:03:20 ┊
┊ crossbowold  ┊ san5   ┊ 7010 ┊ Unused ┊ Ok ┊          UpToDate ┊ 
2020-10-07 00:46:23 ┊
┊ crossbowold  ┊ san6   ┊ 7010 ┊ Unused ┊ Ok    ┊          UpToDate ┊ 
2020-10-07 00:46:22 ┊
┊ crossbowold  ┊ san7   ┊ 7010 ┊ Unused ┊ Ok ┊          UpToDate ┊ 
2020-10-07 00:46:21 ┊
┊ crossbowold  ┊ xen1   ┊ 7010 ┊ InUse  ┊ Ok ┊          Diskless ┊ 
2020-10-15 00:30:31 ┊
┊ jspiteriVM1  ┊ castle ┊ 7011 ┊ Unused ┊ 
StandAlone(san6.mytest.com.au,san7.mytest.com.au)    ┊ SyncTarget(0.00%) 
┊ 2020-10-14 22:15:00 ┊
┊ jspiteriVM1  ┊ san5   ┊ 7011 ┊ Unused ┊ Connecting(san7.mytest.com.au) 
   ┊      Inconsistent ┊ 2020-10-14 22:14:59 ┊
┊ jspiteriVM1  ┊ san6   ┊ 7011 ┊ Unused ┊ 
Connecting(castle.mytest.com.au,san7.mytest.com.au) ┊ SyncTarget(0.00%) 
┊ 2020-10-14 22:14:58 ┊
┊ jspiteriVM1  ┊ san7   ┊ 7011 ┊ Unused ┊ 
Connecting(castle.mytest.com.au),StandAlone(san6.mytest.com.au,san5.mytest.com.au) 
┊      Inconsistent ┊ 2020-10-14 22:14:58 ┊
┊ jspiteriVM1  ┊ xen1   ┊ 7011 ┊ Unused ┊ Ok ┊          Diskless ┊ 
2020-11-20 20:39:20 ┊
┊ ns2          ┊ castle ┊ 7000 ┊ Unused ┊ Ok ┊          UpToDate ┊ 
2020-10-28 23:22:13 ┊
┊ ns2          ┊ flail  ┊ 7000 ┊ Unused ┊ Ok ┊          Diskless ┊ 
2021-01-04 05:03:42 ┊
┊ ns2          ┊ san5   ┊ 7000 ┊ Unused ┊ Ok ┊          UpToDate ┊ 
2020-10-28 23:22:12 ┊
┊ ns2          ┊ san6   ┊ 7000 ┊ Unused ┊ Ok    ┊          UpToDate ┊ 
2020-10-28 23:22:11 ┊
┊ ns2          ┊ xen1   ┊ 7000 ┊ Unused ┊ Ok ┊          Diskless ┊ 
2020-10-28 23:30:20 ┊
┊ windows-wm   ┊ castle ┊ 7001 ┊ Unused ┊ Ok ┊          UpToDate ┊ 
2020-09-30 00:03:41 ┊
┊ windows-wm   ┊ flail  ┊ 7001 ┊ Unused ┊ Ok ┊          Diskless ┊ 
2021-01-04 05:03:48 ┊
┊ windows-wm   ┊ san5   ┊ 7001 ┊ Unused ┊ Ok ┊          UpToDate ┊ 
2020-09-30 00:03:40 ┊
┊ windows-wm   ┊ san6   ┊ 7001 ┊ Unused ┊ Ok ┊          UpToDate ┊ 
2020-09-30 00:03:39 ┊
┊ windows-wm   ┊ san7   ┊ 7001 ┊ Unused ┊ Ok    ┊          UpToDate ┊ 
2020-09-30 00:13:05 ┊
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Could anyone determine from this, or advise what additional logs I 
should examine, to work out why this failed? I don't see anything 
obvious as to what caused linstor/drbd to fail here, all nodes where 
online and un-interrupted as far as I can tell. All physical storage is 
backed by MD raid arrays, so again there is some protection against disk 
failures (haven't noticed any anyway though).

I've since done a upgrade to the latest version of drbd/linstor 
components on all nodes.

Finally, what could I do to recover the data? Has it been destroyed, or 
do I just need to select a node and tell lintor that this node has up to 
date data? Or can linstor work that out somehow?

Regards,
Adam

_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user at lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user



More information about the drbd-user mailing list