Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Thu, Sep 22, 2011 at 07:23:07AM +1000, Ivan Pavlenko wrote: > Lars, > > Thank you very much for your explanation. In this case, if I had > "connection reset by peer" error, situation becomes more strange. Well, what does the log of the other node say? (Why) did it close the connection? What did it log for this same incident? Or your "virtual network" behaves in strange ways sometimes? --> the message immediately preceding the incident: > >>Sep 20 18:44:35 infplsm004<kern.info> kernel: VMCIUtil: Updating > >>context id from 0x775d2835 to 0x775d2835 on event 0. > Actually, I have two resources on this cluster r0 and r1 and I had > the problem with r1 only. If it was communication "hiccup", Call it whatever you like, it still was a replication link interruption for r1, while both are primary. That's why r1 detects the "split-brain". > I'd have > a problem with both resources simultaneously, but I didn't. Split > brain was for r1 only. See my config file below: > > global { > usage-count no; > } > common { > protocol C; > } > > resource r0 { > device /dev/drbd1; > disk /dev/sdb; > meta-disk internal; > net { > allow-two-primaries; > after-sb-0pri discard-zero-changes; > after-sb-1pri discard-secondary; > after-sb-2pri disconnect; > ping-timeout 20; > } > startup { > wfc-timeout 100; > degr-wfc-timeout 60; > become-primary-on both; > } > handlers { > split-brain "/usr/lib/drbd/notify-split-brain.sh root"; > } > > on infplsm004 { > address 192.168.10.9:7789; > } > on infplsm005 { > address 192.168.10.10:7789; > } > } > resource r1 { > device /dev/drbd2; > disk /dev/sdc; > meta-disk internal; > > # This is to allow dual primary mode. > # http://www.drbd.org/users-guide-emb/s-enable-dual-primary.html > net { > allow-two-primaries; > after-sb-0pri discard-zero-changes; > after-sb-1pri discard-secondary; > after-sb-2pri disconnect; > ping-timeout 20; > } > startup { > wfc-timeout 100; > degr-wfc-timeout 60; > become-primary-on both; > } > handlers { > split-brain "/usr/lib/drbd/notify-split-brain.sh root"; > } > > on infplsm004 { > address 192.168.10.9:7790; > } > on infplsm005 { > address 192.168.10.10:7790; > } > } > > Thank you, > Ivan > > > On 09/21/2011 10:15 PM, Lars Ellenberg wrote: > >On Wed, Sep 21, 2011 at 10:08:42AM +1000, Ivan Pavlenko wrote: > >>Hi All, > >> > >>Recently I had split brain onto my cluster. There was a not a big > >>issue, but I still haven't found any reason of this glitch. I got in > >>my log dile next: > >We call it a DRBD resource internal split brain, when you have a period > >in time during which both nodes can not communicate, _and_ both have > >been Primary. > > > >Which means, whenever you run dual-primary DRBD, and have a hickup on > >the replication link, that causes a DRBD "split brain", > >maybe better read that as "potential data-set divergence". > > > >>Sep 20 18:44:35 infplsm004<kern.info> kernel: VMCIUtil: Updating > >>context id from 0x775d2835 to 0x775d2835 on event 0. > >>Sep 20 18:44:35 infplsm004<kern.err> kernel: block drbd2: > >>sock_recvmsg returned -104 > >>Sep 20 18:44:35 infplsm004<kern.info> kernel: block drbd2: peer( > >>Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( > >>UpToDate -> DUnknown ) > >>Sep 20 18:44:35 infplsm004<kern.info> kernel: block drbd2: asender > >>terminated > >>Sep 20 18:44:35 infplsm004<kern.info> kernel: block drbd2: > >>Terminating asender thread > >>Sep 20 18:44:35 infplsm004<kern.err> kernel: block drbd2: short > >>read expecting header on sock: r=-512 > >>Sep 20 18:44:35 infplsm004<kern.info> kernel: block drbd2: Creating > >>new current UUID > >>Sep 20 18:44:36 infplsm004<kern.info> kernel: block drbd2: > >>Connection closed > >>Sep 20 18:44:36 infplsm004<kern.info> kernel: block drbd2: conn( > >>NetworkFailure -> Unconnected ) > >>Sep 20 18:44:36 infplsm004<kern.info> kernel: block drbd2: receiver > >>terminated > >>Sep 20 18:44:36 infplsm004<kern.info> kernel: block drbd2: > >>Restarting receiver thread > >>Sep 20 18:44:36 infplsm004<kern.info> kernel: block drbd2: receiver > >>(re)started > >>Sep 20 18:44:36 infplsm004<kern.info> kernel: block drbd2: conn( > >>Unconnected -> WFConnection ) > >>Sep 20 18:44:38 infplsm004<kern.info> kernel: block drbd2: > >>Handshake successful: Agreed network protocol version 94 > >>Sep 20 18:44:38 infplsm004<kern.info> kernel: block drbd2: conn( > >>WFConnection -> WFReportParams ) > >>Sep 20 18:44:38 infplsm004<kern.info> kernel: block drbd2: Starting > >>asender thread (from drbd2_receiver [11360]) > >>Sep 20 18:44:38 infplsm004<kern.info> kernel: block drbd2: > >>data-integrity-alg:<not-used> > >>Sep 20 18:44:38 infplsm004<kern.info> kernel: block drbd2: > >>drbd_sync_handshake: > >>Sep 20 18:44:38 infplsm004<kern.info> kernel: block drbd2: self > >>AD9C020C7BA6E149:51B8CD59E67A7227:01C987FB5F84C0D1:30241D96D32A31CF > >>bits:1 flags:0 > >>Sep 20 18:44:38 infplsm004<kern.info> kernel: block drbd2: peer > >>A2111F74640A099D:51B8CD59E67A7227:01C987FB5F84C0D0:30241D96D32A31CF > >>bits:0 flags:0 > >>Sep 20 18:44:38 infplsm004<kern.info> kernel: block drbd2: > >>uuid_compare()=100 by rule 90 > >>Sep 20 18:44:38 infplsm004<kern.info> kernel: block drbd2: helper > >>command: /sbin/drbdadm initial-split-brain minor-2 > >>Sep 20 18:44:38 infplsm004<kern.info> kernel: block drbd2: helper > >>command: /sbin/drbdadm initial-split-brain minor-2 exit code 0 (0x0) > >>Sep 20 18:44:38 infplsm004<kern.alert> kernel: block drbd2: > >>Split-Brain detected but unresolved, dropping connection! > >>Sep 20 18:44:38 infplsm004<kern.info> kernel: block drbd2: helper > >>command: /sbin/drbdadm split-brain minor-2 > >>Sep 20 18:44:38 infplsm004<kern.err> kernel: block drbd2: meta > >>connection shut down by peer. > >>Sep 20 18:44:38 infplsm004<kern.info> kernel: block drbd2: conn( > >>WFReportParams -> NetworkFailure ) > >>Sep 20 18:44:38 infplsm004<kern.info> kernel: block drbd2: asender > >>terminated > >>Sep 20 18:44:38 infplsm004<kern.info> kernel: block drbd2: > >>Terminating asender thread > >>Sep 20 18:44:38 infplsm004<kern.info> kernel: block drbd2: helper > >>command: /sbin/drbdadm split-brain minor-2 exit code 0 (0x0) > >>Sep 20 18:44:38 infplsm004<kern.info> kernel: block drbd2: conn( > >>NetworkFailure -> Disconnecting ) > >>Sep 20 18:44:38 infplsm004<kern.err> kernel: block drbd2: error > >>receiving ReportState, l: 4! > >>Sep 20 18:44:38 infplsm004<kern.info> kernel: block drbd2: > >>Connection closed > >>Sep 20 18:44:38 infplsm004<kern.info> kernel: block drbd2: conn( > >>Disconnecting -> StandAlone ) > >>Sep 20 18:44:38 infplsm004<kern.info> kernel: block drbd2: receiver > >>terminated > >>Sep 20 18:44:38 infplsm004<kern.info> kernel: block drbd2: > >>Terminating receiver thread > >> > >>I'd like to stress your attention on first two rows. DRBD socket > >>received messages is code -104. What's it for? Where I can get info > >>about error codes? > >These are typically normal negative errno codes, > >on my box 104 would be ECONNRESET, Connection reset by peer. > > > >>Thank you in advance, > >>Ivan > >> > >>_______________________________________________ > >>drbd-user mailing list > >>drbd-user at lists.linbit.com > >>http://lists.linbit.com/mailman/listinfo/drbd-user > _______________________________________________ > drbd-user mailing list > drbd-user at lists.linbit.com > http://lists.linbit.com/mailman/listinfo/drbd-user -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. __ please don't Cc me, but send to list -- I'm subscribed