Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Repost from 13ten Jan. Hello all, sorry this will be a longer post! I have some strange issues since a few weeks. Sometimes drbd running into a split brain but i not real understand why! I run a proxmox cluster with 2 nodes and only one VM is running on the first node (Node1), so the other node (Node2) is the HA-backupnode to switch the VM when something happen. The disc's are md's on both nodes: Personalities : [raid1] md2 : active raid1 sda3[0] sdb3[1] 2930129536 blocks super 1.2 [2/2] [UU] Drbd-Config: resource r1 { protocol C; startup { wfc-timeout 0; degr-wfc-timeout 60; become-primary-on both; } net { sndbuf-size 10M; rcvbuf-size 10M; ping-int 2; ping-timeout 2; connect-int 2; timeout 5; ko-count 5; max-buffers 128k; max-epoch-size 8192; cram-hmac-alg sha1; shared-secret "XXXXXX"; allow-two-primaries; after-sb-0pri discard-zero-changes; after-sb-1pri discard-secondary; after-sb-2pri disconnect; } on node1 { device /dev/drbd0; disk /dev/md2; address 10.1.5.31:7788; meta-disk internal; } on node2 { device /dev/drbd0; disk /dev/md2; address 10.1.5.32:7788; meta-disk internal; } disk { no-disk-flushes; no-md-flushes; no-disk-barrier; } } The disc for the VM are LV's and only mounted inside the VM, vm-101-disk-1 is the VM-root-fs and disk-2 is the VM-mailstorage. There is/should no mount's or access from the nodes directly! --- Logical volume --- LV Path /dev/drbd0vg/vm-101-disk-1 LV Name vm-101-disk-1 VG Name drbd0vg LV Size 75,00 GiB --- Logical volume --- LV Path /dev/drbd0vg/vm-101-disk-2 LV Name vm-101-disk-2 VG Name drbd0vg LV Size 550,00 GiB The nodes don't use/mount any of /dev/drbd0vg/ Dateisystem Größe Benutzt Verf. Verw% Eingehängt auf udev 10M 0 10M 0% /dev tmpfs 1,6G 504K 1,6G 1% /run /dev/mapper/pve-root 78G 3,0G 72G 4% / tmpfs 5,0M 4,0K 5,0M 1% /run/lock tmpfs 3,2G 50M 3,1G 2% /run/shm /dev/sdc1 232M 72M 148M 33% /boot /dev/fuse 30M 24K 30M 1% /etc/pve So DRBD runns Primary/Primary but how it can changed something on the second node if nothing is runnig there and the LV's are not mounted? It should not exist new data on the drbd-volume on Node2. But last time i did a resync (after i stoped the VM on Node1!) it sync's 15 GB from node2 to node1! Unbelievable!! I take a screenshot but i'm not sure i can attach it here? DRBD-Status on Node1 was: Primary/Primary ds:Inconsistent/UpToDate So i think the left is Node1 and the right is Node2? How Node2 can be UpToDate? I don't understand this because Node2 was running nothing with access to the LV's! Had some Filesystemerrors inside the VM when she was starting after the sync on Node1. :-( Before it was a crosslink cable and i want to make sure there is no problem, so sunday i installed a switch and normal cables only for the drbd-network! (But when the break was happen i not see a eth1 down or something like that) Jan 12 10:49:34 node1 kernel: block drbd0: Remote failed to finish a request within ko-count * timeout Jan 12 10:49:34 node1 kernel: block drbd0: peer( Primary -> Unknown ) conn( Connected -> Timeout ) pdsk( UpToDate -> DUnknown ) Jan 12 10:49:34 node1 kernel: block drbd0: asender terminated Jan 12 10:49:34 node1 kernel: block drbd0: Terminating asender thread Jan 12 10:49:34 node1 kernel: block drbd0: new current UUID D4335C79AD0E0BC3:AE406068788B0F3B:95D06B8F4DD0CE03:95CF6B8F4DD0CE03 Jan 12 10:49:35 node1 kernel: block drbd0: Connection closed Jan 12 10:49:35 node1 kernel: block drbd0: conn( Timeout -> Unconnected ) Jan 12 10:49:35 node1 kernel: block drbd0: receiver terminated Jan 12 10:49:35 node1 kernel: block drbd0: Restarting receiver thread Jan 12 10:49:35 node1 kernel: block drbd0: receiver (re)started Jan 12 10:49:35 node1 kernel: block drbd0: conn( Unconnected -> WFConnection ) Jan 12 10:49:35 node1 kernel: block drbd0: Handshake successful: Agreed network protocol version 96 Jan 12 10:49:35 node1 kernel: block drbd0: Peer authenticated using 20 bytes of 'sha1' HMAC Jan 12 10:49:35 node1 kernel: block drbd0: conn( WFConnection -> WFReportParams ) Jan 12 10:49:35 node1 kernel: block drbd0: Starting asender thread (from drbd0_receiver [2840]) Jan 12 10:49:35 node1 kernel: block drbd0: data-integrity-alg: <not-used> Jan 12 10:49:35 node1 kernel: block drbd0: drbd_sync_handshake: Jan 12 10:49:35 node1 kernel: block drbd0: self D4335C79AD0E0BC3:AE406068788B0F3B:95D06B8F4DD0CE03:95CF6B8F4DD0CE03 bits:42099 flags:0 Jan 12 10:49:35 node1 kernel: block drbd0: peer A5FBD7AF4A9FD583:AE406068788B0F3B:95D06B8F4DD0CE03:95CF6B8F4DD0CE03 bits:0 flags:0 Jan 12 10:49:35 node1 kernel: block drbd0: uuid_compare()=100 by rule 90 Jan 12 10:49:35 node1 kernel: block drbd0: helper command: /sbin/drbdadm initial-split-brain minor-0 Jan 12 10:49:35 node1 kernel: block drbd0: meta connection shut down by peer. Jan 12 10:49:35 node1 kernel: block drbd0: conn( WFReportParams -> NetworkFailure ) Jan 12 10:49:35 node1 kernel: block drbd0: asender terminated Jan 12 10:49:35 node1 kernel: block drbd0: Terminating asender thread Jan 12 10:49:35 node1 kernel: block drbd0: helper command: /sbin/drbdadm initial-split-brain minor-0 exit code 0 (0x0) Jan 12 10:49:35 node1 kernel: block drbd0: Split-Brain detected but unresolved, dropping connection! Jan 12 10:49:35 node1 kernel: block drbd0: helper command: /sbin/drbdadm split-brain minor-0 Jan 12 10:49:35 node1 kernel: block drbd0: helper command: /sbin/drbdadm split-brain minor-0 exit code 0 (0x0) Jan 12 10:49:35 node1 kernel: block drbd0: conn( NetworkFailure -> Disconnecting ) Jan 12 10:49:35 node1 kernel: block drbd0: error receiving ReportState, l: 4! Jan 12 10:49:35 node1 kernel: block drbd0: Connection closed Jan 12 10:49:35 node1 kernel: block drbd0: conn( Disconnecting -> StandAlone ) Jan 12 10:49:35 node1 kernel: block drbd0: receiver terminated Jan 12 10:49:35 node1 kernel: block drbd0: Terminating receiver thread grep eth1 /var/log/kern.log Jan 11 15:10:42 node1 kernel: igb 0000:07:00.1: eth1: (PCIe:5.0GT/s:Width x2) Jan 11 15:10:42 node1 kernel: igb 0000:07:00.1: eth1: MAC: f8:0f:41:fb:32:21 Jan 11 15:10:42 node1 kernel: igb 0000:07:00.1: eth1: PBA No: 106300-000 Jan 11 15:10:42 node1 kernel: ADDRCONF(NETDEV_UP): eth1: link is not ready Jan 11 15:10:42 node1 kernel: igb 0000:07:00.1: eth1: igb: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX Jan 11 15:10:42 node1 kernel: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready Jan 11 15:10:52 node1 kernel: eth1: no IPv6 routers present eth1 is still up, no errors, ping and ssh still working on that interface! eth1 Link encap:Ethernet Hardware Adresse f8:0f:41:fb:32:21 inet Adresse:10.1.5.31 Bcast:10.1.5.255 Maske:255.255.255.0 inet6-Adresse: fe80::fa0f:41ff:fefb:3221/64 Gültigkeitsbereich:Verbindung UP BROADCAST RUNNING MULTICAST MTU:1500 Metrik:1 RX packets:13129653 errors:0 dropped:0 overruns:0 frame:0 TX packets:13383845 errors:0 dropped:0 overruns:0 carrier:0 Kollisionen:0 Sendewarteschlangenlänge:1000 RX bytes:17271836211 (16.0 GiB) TX bytes:15506649645 (14.4 GiB) uname -a Linux node1 2.6.32-34-pve #1 SMP Fri Dec 19 07:42:04 CET 2014 x86_64 GNU/Linux On Node2 i don't find any disc error or something like this! So what can be the problem and how can i fix this? I read the docu and if i understand correctly a automatic split brain repair is not useable for my situation because i don't know where the VM is running last time. Try to attach the screen also. Any hints? Regards Richard PS: I get Eric's post where he mention: "The split brain would only happen on dual primary. " So i changed to Primary/Secondary and stoped the HA in Proxmox. Last few days no errors occur but i have to observe this the next weeks. -------------- next part -------------- A non-text attachment was scrubbed... Name: node1.png Type: image/png Size: 142332 bytes Desc: not available URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20150118/f92ba091/attachment.png>