Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello lists, I want to tell you about my experience with drbd9 triple replication, and ugly synchronization bugs. First, I installed drbd9 on my Ubuntu 16.04, and latest drbd-dkms module: # uname -a Linux controller03 4.4.0-62-generic #83-Ubuntu SMP Wed Jan 18 14:10:15 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux # cat /proc/drbd version: 9.0.6-1 (api:2/proto:86-112) GIT-hash: 08cda190c4f544a0c4e15ba792bbf47c69707b42 build by root at controller03, 2017-02-15 22:52:06 Transports (api:15): tcp (1.0.0) I don't used drbdmanage tool because it requires additional volume group, I use drbdadm instead. My idea is create one replicated device between three primary nodes, and shared cluster filesytem (gfs2) on it. Here is my config: # cat /etc/drbd.d/r0.res resource r0 { net { cram-hmac-alg sha1; shared-secret "hackme"; allow-two-primaries yes; after-sb-0pri discard-zero-changes; after-sb-1pri discard-secondary; after-sb-2pri disconnect; } startup { wfc-timeout 15; degr-wfc-timeout 60; become-primary-on both; } volume 0 { device /dev/drbd0 minor 0; disk /dev/mapper/cl-drbd; meta-disk internal; } on controller01 { node-id 0; address 192.168.101.11:7788; } on controller02 { node-id 1; address 192.168.101.12:7788; } on controller03 { node-id 2; address 192.168.101.13:7788; } connection-mesh { hosts controller01 controller02 controller03; net { protocol C; } } } After deploy and initial syncronisation evrething works fine: # drbdsetup status r0 role:Primary disk:UpToDate controller02 role:Primary peer-disk:UpToDate controller03 role:Primary peer-disk:UpToDate But sometimes, or force reboot one of node for example, I got this incomprehensible situation: # ssh contoller01 drbdsetup status r0 role:Primary disk:UpToDate controller02 role:Primary peer-disk:UpToDate controller03 connection:Connecting # ssh contoller02 drbdsetup status r0 role:Primary disk:UpToDate controller01 role:Primary peer-disk:UpToDate controller03 role:Primary peer-disk:UpToDate # ssh contoller03 drbdsetup status r0 role:Primary disk:UpToDate controller01 connection:Connecting controller02 role:Primary peer-disk:UpToDater0 role:Primary dmesg output from controller01: [ 921.394274] drbd r0 controller03: conn( NetworkFailure -> Unconnected ) [ 921.394295] drbd r0 controller03: Restarting receiver thread [ 921.394328] drbd r0 controller03: conn( Unconnected -> Connecting ) [ 921.885827] drbd r0 controller02: Preparing remote state change 642056233 (primary_nodes=1, weak_nodes=FFFFFFFFFFFFFFFC) [ 921.885948] drbd r0 controller02: Aborting remote state change 642056233 [ 921.909449] drbd r0 controller03: Handshake to peer 2 successful: Agreed network protocol version 112 [ 921.909452] drbd r0 controller03: Feature flags enabled on protocol level: 0x7 TRIM THIN_RESYNC WRITE_SAME. [ 921.909578] drbd r0 controller03: Peer authenticated using 20 bytes HMAC [ 921.909597] drbd r0 controller03: Starting ack_recv thread (from drbd_r_r0 [2361]) [ 922.014142] drbd r0: Preparing cluster-wide state change 668430455 (0->2 499/145) [ 922.014327] drbd r0: Aborting cluster-wide state change 668430455 (0ms) rv = -10 [ 922.014336] drbd r0 controller03: Failure to connect; retrying [ 922.014344] drbd r0 controller03: conn( Connecting -> NetworkFailure ) [ 922.014390] drbd r0 controller03: ack_receiver terminated [ 922.014391] drbd r0 controller03: Terminating ack_recv thread [ 922.038295] drbd r0 controller03: Connection closed dmesg output from controller02: [ 524.670133] drbd r0 controller01: conn( BrokenPipe -> Unconnected ) [ 524.670145] drbd r0 controller01: Restarting receiver thread [ 524.670155] drbd r0 controller01: conn( Unconnected -> Connecting ) [ 525.170032] drbd r0 controller01: Handshake to peer 0 successful: Agreed network protocol version 112 [ 525.170034] drbd r0 controller01: Feature flags enabled on protocol level: 0x7 TRIM THIN_RESYNC WRITE_SAME. [ 525.170169] drbd r0 controller01: Peer authenticated using 20 bytes HMAC [ 525.170176] drbd r0 controller01: Starting ack_recv thread (from drbd_r_r0 [1660]) [ 525.202747] drbd r0 controller01: Preparing remote state change 3822078331 (primary_nodes=0, weak_nodes=0) [ 525.202918] drbd r0 controller01: Aborting remote state change 3822078331 [ 525.222734] drbd r0 controller01: sock was shut down by peer [ 525.222742] drbd r0 controller01: conn( Connecting -> BrokenPipe ) [ 525.222763] drbd r0 controller01: ack_receiver terminated [ 525.222764] drbd r0 controller01: Terminating ack_recv thread [ 525.246142] drbd r0 controller01: Connection closed These actions also do not change anything: # drbdadm secondary r0 # drbdadm disconnect r0 # drbdadm -- --discard-my-data connect r0 Sometimes in the console output I see errors like theese: [ 686.816906] drbd r0: State change failed: Need a connection to start verify or resync [ 730.143054] drbd r0 tcp:controller01: Closing unexpected connection from 192.168.101.12:7788 Also sometimes happens that stops working is not one but both connections. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20170218/24fc6c5d/attachment.htm>