Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Repost from 13ten Jan.
Hello all,
sorry this will be a longer post!
I have some strange issues since a few weeks. Sometimes drbd running into a
split brain but i not real understand why!
I run a proxmox cluster with 2 nodes and only one VM is running on the first
node (Node1), so the other node (Node2) is the HA-backupnode to switch the VM
when something happen.
The disc's are md's on both nodes:
Personalities : [raid1]
md2 : active raid1 sda3[0] sdb3[1]
2930129536 blocks super 1.2 [2/2] [UU]
Drbd-Config:
resource r1 {
protocol C;
startup {
wfc-timeout 0;
degr-wfc-timeout 60;
become-primary-on both;
}
net {
sndbuf-size 10M;
rcvbuf-size 10M;
ping-int 2;
ping-timeout 2;
connect-int 2;
timeout 5;
ko-count 5;
max-buffers 128k;
max-epoch-size 8192;
cram-hmac-alg sha1;
shared-secret "XXXXXX";
allow-two-primaries;
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-2pri disconnect;
}
on node1 {
device /dev/drbd0;
disk /dev/md2;
address 10.1.5.31:7788;
meta-disk internal;
}
on node2 {
device /dev/drbd0;
disk /dev/md2;
address 10.1.5.32:7788;
meta-disk internal;
}
disk {
no-disk-flushes;
no-md-flushes;
no-disk-barrier;
}
}
The disc for the VM are LV's and only mounted inside the VM, vm-101-disk-1
is the VM-root-fs and disk-2 is the VM-mailstorage. There is/should no mount's
or access from the nodes directly!
--- Logical volume ---
LV Path /dev/drbd0vg/vm-101-disk-1
LV Name vm-101-disk-1
VG Name drbd0vg
LV Size 75,00 GiB
--- Logical volume ---
LV Path /dev/drbd0vg/vm-101-disk-2
LV Name vm-101-disk-2
VG Name drbd0vg
LV Size 550,00 GiB
The nodes don't use/mount any of /dev/drbd0vg/
Dateisystem Größe Benutzt Verf. Verw% Eingehängt auf
udev 10M 0 10M 0% /dev
tmpfs 1,6G 504K 1,6G 1% /run
/dev/mapper/pve-root 78G 3,0G 72G 4% /
tmpfs 5,0M 4,0K 5,0M 1% /run/lock
tmpfs 3,2G 50M 3,1G 2% /run/shm
/dev/sdc1 232M 72M 148M 33% /boot
/dev/fuse 30M 24K 30M 1% /etc/pve
So DRBD runns Primary/Primary but how it can changed something on the
second node if nothing is runnig there and the LV's are not mounted? It should
not exist new data on the drbd-volume on Node2. But last time i did a resync
(after i stoped the VM on Node1!) it sync's 15 GB from node2 to node1!
Unbelievable!!
I take a screenshot but i'm not sure i can attach it here?
DRBD-Status on Node1 was:
Primary/Primary ds:Inconsistent/UpToDate
So i think the left is Node1 and the right is Node2? How Node2 can be
UpToDate? I don't understand this because Node2 was running nothing with
access to the LV's!
Had some Filesystemerrors inside the VM when she was starting after the sync
on Node1. :-(
Before it was a crosslink cable and i want to make sure there is no problem,
so sunday i installed a switch and normal cables only for the drbd-network!
(But when the break was happen i not see a eth1 down or something like that)
Jan 12 10:49:34 node1 kernel: block drbd0: Remote failed to finish a request
within ko-count * timeout
Jan 12 10:49:34 node1 kernel: block drbd0: peer( Primary -> Unknown ) conn(
Connected -> Timeout ) pdsk( UpToDate -> DUnknown )
Jan 12 10:49:34 node1 kernel: block drbd0: asender terminated
Jan 12 10:49:34 node1 kernel: block drbd0: Terminating asender thread
Jan 12 10:49:34 node1 kernel: block drbd0: new current UUID
D4335C79AD0E0BC3:AE406068788B0F3B:95D06B8F4DD0CE03:95CF6B8F4DD0CE03
Jan 12 10:49:35 node1 kernel: block drbd0: Connection closed
Jan 12 10:49:35 node1 kernel: block drbd0: conn( Timeout -> Unconnected )
Jan 12 10:49:35 node1 kernel: block drbd0: receiver terminated
Jan 12 10:49:35 node1 kernel: block drbd0: Restarting receiver thread
Jan 12 10:49:35 node1 kernel: block drbd0: receiver (re)started
Jan 12 10:49:35 node1 kernel: block drbd0: conn( Unconnected -> WFConnection )
Jan 12 10:49:35 node1 kernel: block drbd0: Handshake successful: Agreed
network protocol version 96
Jan 12 10:49:35 node1 kernel: block drbd0: Peer authenticated using 20 bytes
of 'sha1' HMAC
Jan 12 10:49:35 node1 kernel: block drbd0: conn( WFConnection ->
WFReportParams )
Jan 12 10:49:35 node1 kernel: block drbd0: Starting asender thread (from
drbd0_receiver [2840])
Jan 12 10:49:35 node1 kernel: block drbd0: data-integrity-alg: <not-used>
Jan 12 10:49:35 node1 kernel: block drbd0: drbd_sync_handshake:
Jan 12 10:49:35 node1 kernel: block drbd0: self
D4335C79AD0E0BC3:AE406068788B0F3B:95D06B8F4DD0CE03:95CF6B8F4DD0CE03 bits:42099
flags:0
Jan 12 10:49:35 node1 kernel: block drbd0: peer
A5FBD7AF4A9FD583:AE406068788B0F3B:95D06B8F4DD0CE03:95CF6B8F4DD0CE03 bits:0
flags:0
Jan 12 10:49:35 node1 kernel: block drbd0: uuid_compare()=100 by rule 90
Jan 12 10:49:35 node1 kernel: block drbd0: helper command: /sbin/drbdadm
initial-split-brain minor-0
Jan 12 10:49:35 node1 kernel: block drbd0: meta connection shut down by peer.
Jan 12 10:49:35 node1 kernel: block drbd0: conn( WFReportParams ->
NetworkFailure )
Jan 12 10:49:35 node1 kernel: block drbd0: asender terminated
Jan 12 10:49:35 node1 kernel: block drbd0: Terminating asender thread
Jan 12 10:49:35 node1 kernel: block drbd0: helper command: /sbin/drbdadm
initial-split-brain minor-0 exit code 0 (0x0)
Jan 12 10:49:35 node1 kernel: block drbd0: Split-Brain detected but
unresolved, dropping connection!
Jan 12 10:49:35 node1 kernel: block drbd0: helper command: /sbin/drbdadm
split-brain minor-0
Jan 12 10:49:35 node1 kernel: block drbd0: helper command: /sbin/drbdadm
split-brain minor-0 exit code 0 (0x0)
Jan 12 10:49:35 node1 kernel: block drbd0: conn( NetworkFailure ->
Disconnecting )
Jan 12 10:49:35 node1 kernel: block drbd0: error receiving ReportState, l: 4!
Jan 12 10:49:35 node1 kernel: block drbd0: Connection closed
Jan 12 10:49:35 node1 kernel: block drbd0: conn( Disconnecting -> StandAlone )
Jan 12 10:49:35 node1 kernel: block drbd0: receiver terminated
Jan 12 10:49:35 node1 kernel: block drbd0: Terminating receiver thread
grep eth1 /var/log/kern.log
Jan 11 15:10:42 node1 kernel: igb 0000:07:00.1: eth1: (PCIe:5.0GT/s:Width x2)
Jan 11 15:10:42 node1 kernel: igb 0000:07:00.1: eth1: MAC: f8:0f:41:fb:32:21
Jan 11 15:10:42 node1 kernel: igb 0000:07:00.1: eth1: PBA No: 106300-000
Jan 11 15:10:42 node1 kernel: ADDRCONF(NETDEV_UP): eth1: link is not ready
Jan 11 15:10:42 node1 kernel: igb 0000:07:00.1: eth1: igb: eth1 NIC Link is Up
1000 Mbps Full Duplex, Flow Control: RX/TX
Jan 11 15:10:42 node1 kernel: ADDRCONF(NETDEV_CHANGE): eth1: link becomes
ready
Jan 11 15:10:52 node1 kernel: eth1: no IPv6 routers present
eth1 is still up, no errors, ping and ssh still working on that interface!
eth1 Link encap:Ethernet Hardware Adresse f8:0f:41:fb:32:21
inet Adresse:10.1.5.31 Bcast:10.1.5.255 Maske:255.255.255.0
inet6-Adresse: fe80::fa0f:41ff:fefb:3221/64
Gültigkeitsbereich:Verbindung
UP BROADCAST RUNNING MULTICAST MTU:1500 Metrik:1
RX packets:13129653 errors:0 dropped:0 overruns:0 frame:0
TX packets:13383845 errors:0 dropped:0 overruns:0 carrier:0
Kollisionen:0 Sendewarteschlangenlänge:1000
RX bytes:17271836211 (16.0 GiB) TX bytes:15506649645 (14.4 GiB)
uname -a
Linux node1 2.6.32-34-pve #1 SMP Fri Dec 19 07:42:04 CET 2014 x86_64 GNU/Linux
On Node2 i don't find any disc error or something like this!
So what can be the problem and how can i fix this? I read the docu and if i
understand correctly a automatic split brain repair is not useable for my
situation because i don't know where the VM is running last time.
Try to attach the screen also.
Any hints?
Regards
Richard
PS: I get Eric's post where he mention: "The split brain would only happen on
dual primary. "
So i changed to Primary/Secondary and stoped the HA in Proxmox.
Last few days no errors occur but i have to observe this the next weeks.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: node1.png
Type: image/png
Size: 142332 bytes
Desc: not available
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20150118/f92ba091/attachment.png>