Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello lists,
I want to tell you about my experience with drbd9 triple replication, and
ugly synchronization bugs.
First, I installed drbd9 on my Ubuntu 16.04, and latest drbd-dkms module:
# uname -a
Linux controller03 4.4.0-62-generic #83-Ubuntu SMP Wed Jan 18 14:10:15
UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
# cat /proc/drbd
version: 9.0.6-1 (api:2/proto:86-112)
GIT-hash: 08cda190c4f544a0c4e15ba792bbf47c69707b42 build by
root at controller03, 2017-02-15 22:52:06
Transports (api:15): tcp (1.0.0)
I don't used drbdmanage tool because it requires additional volume group, I
use drbdadm instead.
My idea is create one replicated device between three primary nodes, and
shared cluster filesytem (gfs2) on it.
Here is my config:
# cat /etc/drbd.d/r0.res
resource r0 {
net {
cram-hmac-alg sha1;
shared-secret "hackme";
allow-two-primaries yes;
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-2pri disconnect;
}
startup {
wfc-timeout 15;
degr-wfc-timeout 60;
become-primary-on both;
}
volume 0 {
device /dev/drbd0 minor 0;
disk /dev/mapper/cl-drbd;
meta-disk internal;
}
on controller01 {
node-id 0;
address 192.168.101.11:7788;
}
on controller02 {
node-id 1;
address 192.168.101.12:7788;
}
on controller03 {
node-id 2;
address 192.168.101.13:7788;
}
connection-mesh {
hosts controller01 controller02 controller03;
net {
protocol C;
}
}
}
After deploy and initial syncronisation evrething works fine:
# drbdsetup status
r0 role:Primary
disk:UpToDate
controller02 role:Primary
peer-disk:UpToDate
controller03 role:Primary
peer-disk:UpToDate
But sometimes, or force reboot one of node for example, I got this
incomprehensible situation:
# ssh contoller01 drbdsetup status
r0 role:Primary
disk:UpToDate
controller02 role:Primary
peer-disk:UpToDate
controller03 connection:Connecting
# ssh contoller02 drbdsetup status
r0 role:Primary
disk:UpToDate
controller01 role:Primary
peer-disk:UpToDate
controller03 role:Primary
peer-disk:UpToDate
# ssh contoller03 drbdsetup status
r0 role:Primary
disk:UpToDate
controller01 connection:Connecting
controller02 role:Primary
peer-disk:UpToDater0 role:Primary
dmesg output from controller01:
[ 921.394274] drbd r0 controller03: conn( NetworkFailure ->
Unconnected )
[ 921.394295] drbd r0 controller03: Restarting receiver thread
[ 921.394328] drbd r0 controller03: conn( Unconnected -> Connecting )
[ 921.885827] drbd r0 controller02: Preparing remote state change
642056233 (primary_nodes=1, weak_nodes=FFFFFFFFFFFFFFFC)
[ 921.885948] drbd r0 controller02: Aborting remote state change
642056233
[ 921.909449] drbd r0 controller03: Handshake to peer 2 successful:
Agreed network protocol version 112
[ 921.909452] drbd r0 controller03: Feature flags enabled on protocol
level: 0x7 TRIM THIN_RESYNC WRITE_SAME.
[ 921.909578] drbd r0 controller03: Peer authenticated using 20 bytes
HMAC
[ 921.909597] drbd r0 controller03: Starting ack_recv thread (from
drbd_r_r0 [2361])
[ 922.014142] drbd r0: Preparing cluster-wide state change 668430455
(0->2 499/145)
[ 922.014327] drbd r0: Aborting cluster-wide state change 668430455
(0ms) rv = -10
[ 922.014336] drbd r0 controller03: Failure to connect; retrying
[ 922.014344] drbd r0 controller03: conn( Connecting -> NetworkFailure
)
[ 922.014390] drbd r0 controller03: ack_receiver terminated
[ 922.014391] drbd r0 controller03: Terminating ack_recv thread
[ 922.038295] drbd r0 controller03: Connection closed
dmesg output from controller02:
[ 524.670133] drbd r0 controller01: conn( BrokenPipe -> Unconnected )
[ 524.670145] drbd r0 controller01: Restarting receiver thread
[ 524.670155] drbd r0 controller01: conn( Unconnected -> Connecting )
[ 525.170032] drbd r0 controller01: Handshake to peer 0 successful:
Agreed network protocol version 112
[ 525.170034] drbd r0 controller01: Feature flags enabled on protocol
level: 0x7 TRIM THIN_RESYNC WRITE_SAME.
[ 525.170169] drbd r0 controller01: Peer authenticated using 20 bytes
HMAC
[ 525.170176] drbd r0 controller01: Starting ack_recv thread (from
drbd_r_r0 [1660])
[ 525.202747] drbd r0 controller01: Preparing remote state change
3822078331 (primary_nodes=0, weak_nodes=0)
[ 525.202918] drbd r0 controller01: Aborting remote state change
3822078331
[ 525.222734] drbd r0 controller01: sock was shut down by peer
[ 525.222742] drbd r0 controller01: conn( Connecting -> BrokenPipe )
[ 525.222763] drbd r0 controller01: ack_receiver terminated
[ 525.222764] drbd r0 controller01: Terminating ack_recv thread
[ 525.246142] drbd r0 controller01: Connection closed
These actions also do not change anything:
# drbdadm secondary r0
# drbdadm disconnect r0
# drbdadm -- --discard-my-data connect r0
Sometimes in the console output I see errors like theese:
[ 686.816906] drbd r0: State change failed: Need a connection to start
verify or resync
[ 730.143054] drbd r0 tcp:controller01: Closing unexpected connection
from 192.168.101.12:7788
Also sometimes happens that stops working is not one but both connections.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20170218/24fc6c5d/attachment.htm>