[DRBD-user] List of drbd socket errors

Ivan Pavlenko i.pavlenko at unsw.edu.au
Wed Sep 21 23:23:07 CEST 2011

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Lars,

Thank you very much for your explanation. In this case, if I had 
"connection reset by peer" error, situation becomes more strange. 
Actually, I have two resources on this cluster r0 and r1 and I had the 
problem with r1 only. If it was communication "hiccup", I'd have a 
problem with both resources simultaneously, but I didn't. Split brain 
was for r1 only. See my config file below:

global {
   usage-count no;
}
common {
   protocol C;
}

resource r0 {
   device    /dev/drbd1;
   disk      /dev/sdb;
   meta-disk internal;
   net {
     allow-two-primaries;
     after-sb-0pri discard-zero-changes;
     after-sb-1pri discard-secondary;
     after-sb-2pri disconnect;
     ping-timeout 20;
   }
   startup {
     wfc-timeout 100;
     degr-wfc-timeout 60;
     become-primary-on both;
   }
   handlers {
     split-brain "/usr/lib/drbd/notify-split-brain.sh root";
   }

   on infplsm004 {
     address   192.168.10.9:7789;
   }
   on infplsm005 {
     address   192.168.10.10:7789;
   }
}
resource r1 {
   device    /dev/drbd2;
   disk      /dev/sdc;
   meta-disk internal;

   # This is to allow dual primary mode.
   # http://www.drbd.org/users-guide-emb/s-enable-dual-primary.html
   net {
     allow-two-primaries;
     after-sb-0pri discard-zero-changes;
     after-sb-1pri discard-secondary;
     after-sb-2pri disconnect;
     ping-timeout 20;
   }
   startup {
     wfc-timeout 100;
     degr-wfc-timeout 60;
     become-primary-on both;
   }
   handlers {
     split-brain "/usr/lib/drbd/notify-split-brain.sh root";
   }

   on infplsm004 {
     address   192.168.10.9:7790;
   }
   on infplsm005 {
     address   192.168.10.10:7790;
   }
}

Thank you,
Ivan


On 09/21/2011 10:15 PM, Lars Ellenberg wrote:
> On Wed, Sep 21, 2011 at 10:08:42AM +1000, Ivan Pavlenko wrote:
>> Hi All,
>>
>> Recently I had split brain onto my cluster. There was a not a big
>> issue, but I still haven't found any reason of this glitch. I got in
>> my log dile next:
> We call it a DRBD resource internal split brain, when you have a period
> in time during which both nodes can not communicate, _and_ both have
> been Primary.
>
> Which means, whenever you run dual-primary DRBD, and have a hickup on
> the replication link, that causes a DRBD "split brain",
> maybe better read that as "potential data-set divergence".
>
>> Sep 20 18:44:35 infplsm004<kern.info>  kernel: VMCIUtil: Updating
>> context id from 0x775d2835 to 0x775d2835 on event 0.
>> Sep 20 18:44:35 infplsm004<kern.err>  kernel: block drbd2:
>> sock_recvmsg returned -104
>> Sep 20 18:44:35 infplsm004<kern.info>  kernel: block drbd2: peer(
>> Primary ->  Unknown ) conn( Connected ->  NetworkFailure ) pdsk(
>> UpToDate ->  DUnknown )
>> Sep 20 18:44:35 infplsm004<kern.info>  kernel: block drbd2: asender
>> terminated
>> Sep 20 18:44:35 infplsm004<kern.info>  kernel: block drbd2:
>> Terminating asender thread
>> Sep 20 18:44:35 infplsm004<kern.err>  kernel: block drbd2: short
>> read expecting header on sock: r=-512
>> Sep 20 18:44:35 infplsm004<kern.info>  kernel: block drbd2: Creating
>> new current UUID
>> Sep 20 18:44:36 infplsm004<kern.info>  kernel: block drbd2:
>> Connection closed
>> Sep 20 18:44:36 infplsm004<kern.info>  kernel: block drbd2: conn(
>> NetworkFailure ->  Unconnected )
>> Sep 20 18:44:36 infplsm004<kern.info>  kernel: block drbd2: receiver
>> terminated
>> Sep 20 18:44:36 infplsm004<kern.info>  kernel: block drbd2:
>> Restarting receiver thread
>> Sep 20 18:44:36 infplsm004<kern.info>  kernel: block drbd2: receiver
>> (re)started
>> Sep 20 18:44:36 infplsm004<kern.info>  kernel: block drbd2: conn(
>> Unconnected ->  WFConnection )
>> Sep 20 18:44:38 infplsm004<kern.info>  kernel: block drbd2:
>> Handshake successful: Agreed network protocol version 94
>> Sep 20 18:44:38 infplsm004<kern.info>  kernel: block drbd2: conn(
>> WFConnection ->  WFReportParams )
>> Sep 20 18:44:38 infplsm004<kern.info>  kernel: block drbd2: Starting
>> asender thread (from drbd2_receiver [11360])
>> Sep 20 18:44:38 infplsm004<kern.info>  kernel: block drbd2:
>> data-integrity-alg:<not-used>
>> Sep 20 18:44:38 infplsm004<kern.info>  kernel: block drbd2:
>> drbd_sync_handshake:
>> Sep 20 18:44:38 infplsm004<kern.info>  kernel: block drbd2: self
>> AD9C020C7BA6E149:51B8CD59E67A7227:01C987FB5F84C0D1:30241D96D32A31CF
>> bits:1 flags:0
>> Sep 20 18:44:38 infplsm004<kern.info>  kernel: block drbd2: peer
>> A2111F74640A099D:51B8CD59E67A7227:01C987FB5F84C0D0:30241D96D32A31CF
>> bits:0 flags:0
>> Sep 20 18:44:38 infplsm004<kern.info>  kernel: block drbd2:
>> uuid_compare()=100 by rule 90
>> Sep 20 18:44:38 infplsm004<kern.info>  kernel: block drbd2: helper
>> command: /sbin/drbdadm initial-split-brain minor-2
>> Sep 20 18:44:38 infplsm004<kern.info>  kernel: block drbd2: helper
>> command: /sbin/drbdadm initial-split-brain minor-2 exit code 0 (0x0)
>> Sep 20 18:44:38 infplsm004<kern.alert>  kernel: block drbd2:
>> Split-Brain detected but unresolved, dropping connection!
>> Sep 20 18:44:38 infplsm004<kern.info>  kernel: block drbd2: helper
>> command: /sbin/drbdadm split-brain minor-2
>> Sep 20 18:44:38 infplsm004<kern.err>  kernel: block drbd2: meta
>> connection shut down by peer.
>> Sep 20 18:44:38 infplsm004<kern.info>  kernel: block drbd2: conn(
>> WFReportParams ->  NetworkFailure )
>> Sep 20 18:44:38 infplsm004<kern.info>  kernel: block drbd2: asender
>> terminated
>> Sep 20 18:44:38 infplsm004<kern.info>  kernel: block drbd2:
>> Terminating asender thread
>> Sep 20 18:44:38 infplsm004<kern.info>  kernel: block drbd2: helper
>> command: /sbin/drbdadm split-brain minor-2 exit code 0 (0x0)
>> Sep 20 18:44:38 infplsm004<kern.info>  kernel: block drbd2: conn(
>> NetworkFailure ->  Disconnecting )
>> Sep 20 18:44:38 infplsm004<kern.err>  kernel: block drbd2: error
>> receiving ReportState, l: 4!
>> Sep 20 18:44:38 infplsm004<kern.info>  kernel: block drbd2:
>> Connection closed
>> Sep 20 18:44:38 infplsm004<kern.info>  kernel: block drbd2: conn(
>> Disconnecting ->  StandAlone )
>> Sep 20 18:44:38 infplsm004<kern.info>  kernel: block drbd2: receiver
>> terminated
>> Sep 20 18:44:38 infplsm004<kern.info>  kernel: block drbd2:
>> Terminating receiver thread
>>
>> I'd like to stress your attention on first two rows.  DRBD socket
>> received messages is code -104. What's it for? Where I can get info
>> about error codes?
> These are typically normal negative errno codes,
> on my box 104 would be ECONNRESET, Connection reset by peer.
>
>> Thank you in advance,
>> Ivan
>>
>> _______________________________________________
>> drbd-user mailing list
>> drbd-user at lists.linbit.com
>> http://lists.linbit.com/mailman/listinfo/drbd-user



More information about the drbd-user mailing list