Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Wed, Sep 21, 2011 at 10:08:42AM +1000, Ivan Pavlenko wrote: > Hi All, > > Recently I had split brain onto my cluster. There was a not a big > issue, but I still haven't found any reason of this glitch. I got in > my log dile next: We call it a DRBD resource internal split brain, when you have a period in time during which both nodes can not communicate, _and_ both have been Primary. Which means, whenever you run dual-primary DRBD, and have a hickup on the replication link, that causes a DRBD "split brain", maybe better read that as "potential data-set divergence". > Sep 20 18:44:35 infplsm004 <kern.info> kernel: VMCIUtil: Updating > context id from 0x775d2835 to 0x775d2835 on event 0. > Sep 20 18:44:35 infplsm004 <kern.err> kernel: block drbd2: > sock_recvmsg returned -104 > Sep 20 18:44:35 infplsm004 <kern.info> kernel: block drbd2: peer( > Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( > UpToDate -> DUnknown ) > Sep 20 18:44:35 infplsm004 <kern.info> kernel: block drbd2: asender > terminated > Sep 20 18:44:35 infplsm004 <kern.info> kernel: block drbd2: > Terminating asender thread > Sep 20 18:44:35 infplsm004 <kern.err> kernel: block drbd2: short > read expecting header on sock: r=-512 > Sep 20 18:44:35 infplsm004 <kern.info> kernel: block drbd2: Creating > new current UUID > Sep 20 18:44:36 infplsm004 <kern.info> kernel: block drbd2: > Connection closed > Sep 20 18:44:36 infplsm004 <kern.info> kernel: block drbd2: conn( > NetworkFailure -> Unconnected ) > Sep 20 18:44:36 infplsm004 <kern.info> kernel: block drbd2: receiver > terminated > Sep 20 18:44:36 infplsm004 <kern.info> kernel: block drbd2: > Restarting receiver thread > Sep 20 18:44:36 infplsm004 <kern.info> kernel: block drbd2: receiver > (re)started > Sep 20 18:44:36 infplsm004 <kern.info> kernel: block drbd2: conn( > Unconnected -> WFConnection ) > Sep 20 18:44:38 infplsm004 <kern.info> kernel: block drbd2: > Handshake successful: Agreed network protocol version 94 > Sep 20 18:44:38 infplsm004 <kern.info> kernel: block drbd2: conn( > WFConnection -> WFReportParams ) > Sep 20 18:44:38 infplsm004 <kern.info> kernel: block drbd2: Starting > asender thread (from drbd2_receiver [11360]) > Sep 20 18:44:38 infplsm004 <kern.info> kernel: block drbd2: > data-integrity-alg: <not-used> > Sep 20 18:44:38 infplsm004 <kern.info> kernel: block drbd2: > drbd_sync_handshake: > Sep 20 18:44:38 infplsm004 <kern.info> kernel: block drbd2: self > AD9C020C7BA6E149:51B8CD59E67A7227:01C987FB5F84C0D1:30241D96D32A31CF > bits:1 flags:0 > Sep 20 18:44:38 infplsm004 <kern.info> kernel: block drbd2: peer > A2111F74640A099D:51B8CD59E67A7227:01C987FB5F84C0D0:30241D96D32A31CF > bits:0 flags:0 > Sep 20 18:44:38 infplsm004 <kern.info> kernel: block drbd2: > uuid_compare()=100 by rule 90 > Sep 20 18:44:38 infplsm004 <kern.info> kernel: block drbd2: helper > command: /sbin/drbdadm initial-split-brain minor-2 > Sep 20 18:44:38 infplsm004 <kern.info> kernel: block drbd2: helper > command: /sbin/drbdadm initial-split-brain minor-2 exit code 0 (0x0) > Sep 20 18:44:38 infplsm004 <kern.alert> kernel: block drbd2: > Split-Brain detected but unresolved, dropping connection! > Sep 20 18:44:38 infplsm004 <kern.info> kernel: block drbd2: helper > command: /sbin/drbdadm split-brain minor-2 > Sep 20 18:44:38 infplsm004 <kern.err> kernel: block drbd2: meta > connection shut down by peer. > Sep 20 18:44:38 infplsm004 <kern.info> kernel: block drbd2: conn( > WFReportParams -> NetworkFailure ) > Sep 20 18:44:38 infplsm004 <kern.info> kernel: block drbd2: asender > terminated > Sep 20 18:44:38 infplsm004 <kern.info> kernel: block drbd2: > Terminating asender thread > Sep 20 18:44:38 infplsm004 <kern.info> kernel: block drbd2: helper > command: /sbin/drbdadm split-brain minor-2 exit code 0 (0x0) > Sep 20 18:44:38 infplsm004 <kern.info> kernel: block drbd2: conn( > NetworkFailure -> Disconnecting ) > Sep 20 18:44:38 infplsm004 <kern.err> kernel: block drbd2: error > receiving ReportState, l: 4! > Sep 20 18:44:38 infplsm004 <kern.info> kernel: block drbd2: > Connection closed > Sep 20 18:44:38 infplsm004 <kern.info> kernel: block drbd2: conn( > Disconnecting -> StandAlone ) > Sep 20 18:44:38 infplsm004 <kern.info> kernel: block drbd2: receiver > terminated > Sep 20 18:44:38 infplsm004 <kern.info> kernel: block drbd2: > Terminating receiver thread > > I'd like to stress your attention on first two rows. DRBD socket > received messages is code -104. What's it for? Where I can get info > about error codes? These are typically normal negative errno codes, on my box 104 would be ECONNRESET, Connection reset by peer. > > Thank you in advance, > Ivan > > _______________________________________________ > drbd-user mailing list > drbd-user at lists.linbit.com > http://lists.linbit.com/mailman/listinfo/drbd-user -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. __ please don't Cc me, but send to list -- I'm subscribed