[DRBD-user] 'Unable to bind sock' and strange error.
Jon Nelson
jnelson-drbd at jamponi.net
Thu Mar 22 23:14:17 CET 2007
On Thu, 22 Mar 2007, Lars Ellenberg wrote:
...
> > 0. What caused the transition from 'Primary' to 'Standalone' on
> > 'machineA'?
>
> it has no network connection (B was down).
> so it retries every some seconds to
> passively bind() and listen() waiting for an active connect of the peer,
> then to realease them again, and actively connect() to the peer itself.
> (this is waiting for connection, WFConnection).
>
> for some reason, for one of these attemts, it gets a -98 (EADDRINUSE).
> so it quits trying to establish the connection.
> (this is no connection, no attemt to connect, so no peer: StandAlone).
Let me prefix the remainder of my response with a "thank-you", I was
pleasantly surprised by the quick response.
Regarding the explanation:
Aha! That's what I thought, but I can't determine why it it would
get EADDRINUSE at 3-something in the AM.
> > 1. On 'machineA', if it was not the primary, why did it synchronize?
>
> SyncSource does not imply Primary.
> Primary does not necessarily mean best and up-to-date data.
>
> it syncs, because it knows it has the better (newer) data.
Aha. OK, that clears that up.
> > 2. Why didn't it move back form 'StandAlone' to 'Primary' when the
> > determination had been made that it's peer was definately already in
> > Secondary?
>
> because "StandAlone" is a connection state,
> Primary is a node role.
I think I understand now.
> it has been "StandAlone Primary/Unknown".
> you "restart" it. I read this as
> /etc/init.d/drbd stop; /etc/init.d/start
More or less, that is correct.
> stop makes it secondary (otherwise it could not unconfigure it),
> then unconfigures it.
>
> start configures it.
>
> it does not, however, promote to primary.
> that would be the job of a cluster manager.
> you do have a cluster manager, don't you?
No, I don't have a traditional cluster set up here, I'm doing some
testing and probably doing some things that aren't along the usual uses
of drbd (no heartbeat for example). In this case, I've got what is
essentially an opportunistic network raid1 - when machineB comes up
(infrequently) it gets a copy of only what has changed and when up to
date it is a convenient mirror. If machineA's backing store ever died
(actually it did less than a week ago) I can replace it, bring it and
machineB up, and machineB will sync back to machineA and then I can
reset the drbd on machineA as primary at which point I can use it again.
I know, it's probably a little weird.
All that really remains is to try to understand why drbd failed a bind
call at 3:29:01. I can guarantee that no cron job ran and the machine
was otherwise idle, and have otherwise scoured the logs for anything
that might be relevant and found, unfortunately, nothing - drbd is the
only thing logged for several minutes in either direction.
--
Jon Nelson <jnelson-drbd at jamponi.net>
More information about the drbd-user
mailing list