Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Lars, Thank you very much for your explanation. In this case, if I had "connection reset by peer" error, situation becomes more strange. Actually, I have two resources on this cluster r0 and r1 and I had the problem with r1 only. If it was communication "hiccup", I'd have a problem with both resources simultaneously, but I didn't. Split brain was for r1 only. See my config file below: global { usage-count no; } common { protocol C; } resource r0 { device /dev/drbd1; disk /dev/sdb; meta-disk internal; net { allow-two-primaries; after-sb-0pri discard-zero-changes; after-sb-1pri discard-secondary; after-sb-2pri disconnect; ping-timeout 20; } startup { wfc-timeout 100; degr-wfc-timeout 60; become-primary-on both; } handlers { split-brain "/usr/lib/drbd/notify-split-brain.sh root"; } on infplsm004 { address 192.168.10.9:7789; } on infplsm005 { address 192.168.10.10:7789; } } resource r1 { device /dev/drbd2; disk /dev/sdc; meta-disk internal; # This is to allow dual primary mode. # http://www.drbd.org/users-guide-emb/s-enable-dual-primary.html net { allow-two-primaries; after-sb-0pri discard-zero-changes; after-sb-1pri discard-secondary; after-sb-2pri disconnect; ping-timeout 20; } startup { wfc-timeout 100; degr-wfc-timeout 60; become-primary-on both; } handlers { split-brain "/usr/lib/drbd/notify-split-brain.sh root"; } on infplsm004 { address 192.168.10.9:7790; } on infplsm005 { address 192.168.10.10:7790; } } Thank you, Ivan On 09/21/2011 10:15 PM, Lars Ellenberg wrote: > On Wed, Sep 21, 2011 at 10:08:42AM +1000, Ivan Pavlenko wrote: >> Hi All, >> >> Recently I had split brain onto my cluster. There was a not a big >> issue, but I still haven't found any reason of this glitch. I got in >> my log dile next: > We call it a DRBD resource internal split brain, when you have a period > in time during which both nodes can not communicate, _and_ both have > been Primary. > > Which means, whenever you run dual-primary DRBD, and have a hickup on > the replication link, that causes a DRBD "split brain", > maybe better read that as "potential data-set divergence". > >> Sep 20 18:44:35 infplsm004<kern.info> kernel: VMCIUtil: Updating >> context id from 0x775d2835 to 0x775d2835 on event 0. >> Sep 20 18:44:35 infplsm004<kern.err> kernel: block drbd2: >> sock_recvmsg returned -104 >> Sep 20 18:44:35 infplsm004<kern.info> kernel: block drbd2: peer( >> Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( >> UpToDate -> DUnknown ) >> Sep 20 18:44:35 infplsm004<kern.info> kernel: block drbd2: asender >> terminated >> Sep 20 18:44:35 infplsm004<kern.info> kernel: block drbd2: >> Terminating asender thread >> Sep 20 18:44:35 infplsm004<kern.err> kernel: block drbd2: short >> read expecting header on sock: r=-512 >> Sep 20 18:44:35 infplsm004<kern.info> kernel: block drbd2: Creating >> new current UUID >> Sep 20 18:44:36 infplsm004<kern.info> kernel: block drbd2: >> Connection closed >> Sep 20 18:44:36 infplsm004<kern.info> kernel: block drbd2: conn( >> NetworkFailure -> Unconnected ) >> Sep 20 18:44:36 infplsm004<kern.info> kernel: block drbd2: receiver >> terminated >> Sep 20 18:44:36 infplsm004<kern.info> kernel: block drbd2: >> Restarting receiver thread >> Sep 20 18:44:36 infplsm004<kern.info> kernel: block drbd2: receiver >> (re)started >> Sep 20 18:44:36 infplsm004<kern.info> kernel: block drbd2: conn( >> Unconnected -> WFConnection ) >> Sep 20 18:44:38 infplsm004<kern.info> kernel: block drbd2: >> Handshake successful: Agreed network protocol version 94 >> Sep 20 18:44:38 infplsm004<kern.info> kernel: block drbd2: conn( >> WFConnection -> WFReportParams ) >> Sep 20 18:44:38 infplsm004<kern.info> kernel: block drbd2: Starting >> asender thread (from drbd2_receiver [11360]) >> Sep 20 18:44:38 infplsm004<kern.info> kernel: block drbd2: >> data-integrity-alg:<not-used> >> Sep 20 18:44:38 infplsm004<kern.info> kernel: block drbd2: >> drbd_sync_handshake: >> Sep 20 18:44:38 infplsm004<kern.info> kernel: block drbd2: self >> AD9C020C7BA6E149:51B8CD59E67A7227:01C987FB5F84C0D1:30241D96D32A31CF >> bits:1 flags:0 >> Sep 20 18:44:38 infplsm004<kern.info> kernel: block drbd2: peer >> A2111F74640A099D:51B8CD59E67A7227:01C987FB5F84C0D0:30241D96D32A31CF >> bits:0 flags:0 >> Sep 20 18:44:38 infplsm004<kern.info> kernel: block drbd2: >> uuid_compare()=100 by rule 90 >> Sep 20 18:44:38 infplsm004<kern.info> kernel: block drbd2: helper >> command: /sbin/drbdadm initial-split-brain minor-2 >> Sep 20 18:44:38 infplsm004<kern.info> kernel: block drbd2: helper >> command: /sbin/drbdadm initial-split-brain minor-2 exit code 0 (0x0) >> Sep 20 18:44:38 infplsm004<kern.alert> kernel: block drbd2: >> Split-Brain detected but unresolved, dropping connection! >> Sep 20 18:44:38 infplsm004<kern.info> kernel: block drbd2: helper >> command: /sbin/drbdadm split-brain minor-2 >> Sep 20 18:44:38 infplsm004<kern.err> kernel: block drbd2: meta >> connection shut down by peer. >> Sep 20 18:44:38 infplsm004<kern.info> kernel: block drbd2: conn( >> WFReportParams -> NetworkFailure ) >> Sep 20 18:44:38 infplsm004<kern.info> kernel: block drbd2: asender >> terminated >> Sep 20 18:44:38 infplsm004<kern.info> kernel: block drbd2: >> Terminating asender thread >> Sep 20 18:44:38 infplsm004<kern.info> kernel: block drbd2: helper >> command: /sbin/drbdadm split-brain minor-2 exit code 0 (0x0) >> Sep 20 18:44:38 infplsm004<kern.info> kernel: block drbd2: conn( >> NetworkFailure -> Disconnecting ) >> Sep 20 18:44:38 infplsm004<kern.err> kernel: block drbd2: error >> receiving ReportState, l: 4! >> Sep 20 18:44:38 infplsm004<kern.info> kernel: block drbd2: >> Connection closed >> Sep 20 18:44:38 infplsm004<kern.info> kernel: block drbd2: conn( >> Disconnecting -> StandAlone ) >> Sep 20 18:44:38 infplsm004<kern.info> kernel: block drbd2: receiver >> terminated >> Sep 20 18:44:38 infplsm004<kern.info> kernel: block drbd2: >> Terminating receiver thread >> >> I'd like to stress your attention on first two rows. DRBD socket >> received messages is code -104. What's it for? Where I can get info >> about error codes? > These are typically normal negative errno codes, > on my box 104 would be ECONNRESET, Connection reset by peer. > >> Thank you in advance, >> Ivan >> >> _______________________________________________ >> drbd-user mailing list >> drbd-user at lists.linbit.com >> http://lists.linbit.com/mailman/listinfo/drbd-user