Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Lars,
Thank you very much for your explanation. In this case, if I had
"connection reset by peer" error, situation becomes more strange.
Actually, I have two resources on this cluster r0 and r1 and I had the
problem with r1 only. If it was communication "hiccup", I'd have a
problem with both resources simultaneously, but I didn't. Split brain
was for r1 only. See my config file below:
global {
usage-count no;
}
common {
protocol C;
}
resource r0 {
device /dev/drbd1;
disk /dev/sdb;
meta-disk internal;
net {
allow-two-primaries;
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-2pri disconnect;
ping-timeout 20;
}
startup {
wfc-timeout 100;
degr-wfc-timeout 60;
become-primary-on both;
}
handlers {
split-brain "/usr/lib/drbd/notify-split-brain.sh root";
}
on infplsm004 {
address 192.168.10.9:7789;
}
on infplsm005 {
address 192.168.10.10:7789;
}
}
resource r1 {
device /dev/drbd2;
disk /dev/sdc;
meta-disk internal;
# This is to allow dual primary mode.
# http://www.drbd.org/users-guide-emb/s-enable-dual-primary.html
net {
allow-two-primaries;
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-2pri disconnect;
ping-timeout 20;
}
startup {
wfc-timeout 100;
degr-wfc-timeout 60;
become-primary-on both;
}
handlers {
split-brain "/usr/lib/drbd/notify-split-brain.sh root";
}
on infplsm004 {
address 192.168.10.9:7790;
}
on infplsm005 {
address 192.168.10.10:7790;
}
}
Thank you,
Ivan
On 09/21/2011 10:15 PM, Lars Ellenberg wrote:
> On Wed, Sep 21, 2011 at 10:08:42AM +1000, Ivan Pavlenko wrote:
>> Hi All,
>>
>> Recently I had split brain onto my cluster. There was a not a big
>> issue, but I still haven't found any reason of this glitch. I got in
>> my log dile next:
> We call it a DRBD resource internal split brain, when you have a period
> in time during which both nodes can not communicate, _and_ both have
> been Primary.
>
> Which means, whenever you run dual-primary DRBD, and have a hickup on
> the replication link, that causes a DRBD "split brain",
> maybe better read that as "potential data-set divergence".
>
>> Sep 20 18:44:35 infplsm004<kern.info> kernel: VMCIUtil: Updating
>> context id from 0x775d2835 to 0x775d2835 on event 0.
>> Sep 20 18:44:35 infplsm004<kern.err> kernel: block drbd2:
>> sock_recvmsg returned -104
>> Sep 20 18:44:35 infplsm004<kern.info> kernel: block drbd2: peer(
>> Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk(
>> UpToDate -> DUnknown )
>> Sep 20 18:44:35 infplsm004<kern.info> kernel: block drbd2: asender
>> terminated
>> Sep 20 18:44:35 infplsm004<kern.info> kernel: block drbd2:
>> Terminating asender thread
>> Sep 20 18:44:35 infplsm004<kern.err> kernel: block drbd2: short
>> read expecting header on sock: r=-512
>> Sep 20 18:44:35 infplsm004<kern.info> kernel: block drbd2: Creating
>> new current UUID
>> Sep 20 18:44:36 infplsm004<kern.info> kernel: block drbd2:
>> Connection closed
>> Sep 20 18:44:36 infplsm004<kern.info> kernel: block drbd2: conn(
>> NetworkFailure -> Unconnected )
>> Sep 20 18:44:36 infplsm004<kern.info> kernel: block drbd2: receiver
>> terminated
>> Sep 20 18:44:36 infplsm004<kern.info> kernel: block drbd2:
>> Restarting receiver thread
>> Sep 20 18:44:36 infplsm004<kern.info> kernel: block drbd2: receiver
>> (re)started
>> Sep 20 18:44:36 infplsm004<kern.info> kernel: block drbd2: conn(
>> Unconnected -> WFConnection )
>> Sep 20 18:44:38 infplsm004<kern.info> kernel: block drbd2:
>> Handshake successful: Agreed network protocol version 94
>> Sep 20 18:44:38 infplsm004<kern.info> kernel: block drbd2: conn(
>> WFConnection -> WFReportParams )
>> Sep 20 18:44:38 infplsm004<kern.info> kernel: block drbd2: Starting
>> asender thread (from drbd2_receiver [11360])
>> Sep 20 18:44:38 infplsm004<kern.info> kernel: block drbd2:
>> data-integrity-alg:<not-used>
>> Sep 20 18:44:38 infplsm004<kern.info> kernel: block drbd2:
>> drbd_sync_handshake:
>> Sep 20 18:44:38 infplsm004<kern.info> kernel: block drbd2: self
>> AD9C020C7BA6E149:51B8CD59E67A7227:01C987FB5F84C0D1:30241D96D32A31CF
>> bits:1 flags:0
>> Sep 20 18:44:38 infplsm004<kern.info> kernel: block drbd2: peer
>> A2111F74640A099D:51B8CD59E67A7227:01C987FB5F84C0D0:30241D96D32A31CF
>> bits:0 flags:0
>> Sep 20 18:44:38 infplsm004<kern.info> kernel: block drbd2:
>> uuid_compare()=100 by rule 90
>> Sep 20 18:44:38 infplsm004<kern.info> kernel: block drbd2: helper
>> command: /sbin/drbdadm initial-split-brain minor-2
>> Sep 20 18:44:38 infplsm004<kern.info> kernel: block drbd2: helper
>> command: /sbin/drbdadm initial-split-brain minor-2 exit code 0 (0x0)
>> Sep 20 18:44:38 infplsm004<kern.alert> kernel: block drbd2:
>> Split-Brain detected but unresolved, dropping connection!
>> Sep 20 18:44:38 infplsm004<kern.info> kernel: block drbd2: helper
>> command: /sbin/drbdadm split-brain minor-2
>> Sep 20 18:44:38 infplsm004<kern.err> kernel: block drbd2: meta
>> connection shut down by peer.
>> Sep 20 18:44:38 infplsm004<kern.info> kernel: block drbd2: conn(
>> WFReportParams -> NetworkFailure )
>> Sep 20 18:44:38 infplsm004<kern.info> kernel: block drbd2: asender
>> terminated
>> Sep 20 18:44:38 infplsm004<kern.info> kernel: block drbd2:
>> Terminating asender thread
>> Sep 20 18:44:38 infplsm004<kern.info> kernel: block drbd2: helper
>> command: /sbin/drbdadm split-brain minor-2 exit code 0 (0x0)
>> Sep 20 18:44:38 infplsm004<kern.info> kernel: block drbd2: conn(
>> NetworkFailure -> Disconnecting )
>> Sep 20 18:44:38 infplsm004<kern.err> kernel: block drbd2: error
>> receiving ReportState, l: 4!
>> Sep 20 18:44:38 infplsm004<kern.info> kernel: block drbd2:
>> Connection closed
>> Sep 20 18:44:38 infplsm004<kern.info> kernel: block drbd2: conn(
>> Disconnecting -> StandAlone )
>> Sep 20 18:44:38 infplsm004<kern.info> kernel: block drbd2: receiver
>> terminated
>> Sep 20 18:44:38 infplsm004<kern.info> kernel: block drbd2:
>> Terminating receiver thread
>>
>> I'd like to stress your attention on first two rows. DRBD socket
>> received messages is code -104. What's it for? Where I can get info
>> about error codes?
> These are typically normal negative errno codes,
> on my box 104 would be ECONNRESET, Connection reset by peer.
>
>> Thank you in advance,
>> Ivan
>>
>> _______________________________________________
>> drbd-user mailing list
>> drbd-user at lists.linbit.com
>> http://lists.linbit.com/mailman/listinfo/drbd-user