[DRBD-user] 'Unable to bind sock' and strange error.

Thu Mar 22 23:14:17 CET 2007

On Thu, 22 Mar 2007, Lars Ellenberg wrote:

...

> > 0. What caused the transition from 'Primary' to 'Standalone' on
> >    'machineA'?
> 
> it has no network connection (B was down).
> so it retries every some seconds to
> passively bind() and listen() waiting for an active connect of the peer,
> then to realease them again, and actively connect() to the peer itself.
> (this is waiting for connection, WFConnection).
> 
> for some reason, for one of these attemts, it gets a -98 (EADDRINUSE).
> so it quits trying to establish the connection.
> (this is no connection, no attemt to connect, so no peer: StandAlone).

Let me prefix the remainder of my response with a "thank-you", I was 
pleasantly surprised by the quick response.

Regarding the explanation:

Aha!  That's what I thought, but I can't determine why it it would 
get EADDRINUSE at 3-something in the AM.

> > 1. On 'machineA', if it was not the primary, why did it synchronize?
> 
> SyncSource does not imply Primary.
> Primary does not necessarily mean best and up-to-date data.
> 
> it syncs, because it knows it has the better (newer) data.

Aha. OK, that clears that up.

> > 2. Why didn't it move back form 'StandAlone' to 'Primary' when the
> >    determination had been made that it's peer was definately already in
> >    Secondary?
> 
> because "StandAlone" is a connection state,
> Primary is a node role.

I think I understand now.

> it has been "StandAlone Primary/Unknown".
> you "restart" it. I read this as
>   /etc/init.d/drbd stop; /etc/init.d/start

More or less, that is correct.

> stop makes it secondary (otherwise it could not unconfigure it),
> then unconfigures it.
> 
> start configures it.
> 
> it does not, however, promote to primary.
> that would be the job of a cluster manager.
> you do have a cluster manager, don't you?

No, I don't have a traditional cluster set up here, I'm doing some 
testing and probably doing some things that aren't along the usual uses 
of drbd (no heartbeat for example).  In this case, I've got what is 
essentially an opportunistic network raid1 - when machineB comes up 
(infrequently) it gets a copy of only what has changed and when up to 
date it is a convenient mirror. If machineA's backing store ever died 
(actually it did less than a week ago) I can replace it, bring it and 
machineB up, and machineB will sync back to machineA and then I can 
reset the drbd on machineA as primary at which point I can use it again. 
I know, it's probably a little weird.

All that really remains is to try to understand why drbd failed a bind 
call at 3:29:01. I can guarantee that no cron job ran and the machine 
was otherwise idle, and have otherwise scoured the logs for anything 
that might be relevant and found, unfortunately, nothing - drbd is the 
only thing logged for several minutes in either direction.

--
Jon Nelson <jnelson-drbd at jamponi.net>