[DRBD-user] 'Unable to bind sock' and strange error.

Lars Ellenberg lars.ellenberg at linbit.com
Thu Mar 22 22:22:19 CET 2007

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


On Thu, Mar 22, 2007 at 03:57:57PM -0500, Jon Nelson wrote:
> 
> I had something strange happen in my drbd testing environment.
> The machine 'machineA' is the primary and 'machineB' is the secondary.
> 'machineA' is up 24/7 and 'machineB' is on and off throughout the day.
> 
> However, the problem I encountered I can't really explain.  I am not
> aware of any process that goes off at 3:29 that would cause this. The
> machine did not produce this error any day prior to today, and has run
> with an un-changed configuration for a while.
> 
> First, the logs from the primary (with comments intermingled):
> 
> Mar 22 03:29:01 machineA kernel: drbd2: Unable to bind sock2 (-98)
> Mar 22 03:29:01 machineA kernel: drbd2: drbd2_receiver [17538]: cstate WFConnection --> Unconnected
> Mar 22 03:29:01 machineA kernel: drbd2: worker terminated
> Mar 22 03:29:01 machineA kernel: drbd2: drbd2_receiver [17538]: cstate Unconnected --> Unconnected
> Mar 22 03:29:01 machineA kernel: drbd2: Connection lost.
> Mar 22 03:29:01 machineA kernel: drbd2: Discarding network configuration.
> Mar 22 03:29:01 machineA kernel: drbd2: drbd2_receiver [17538]: cstate Unconnected --> StandAlone
> Mar 22 03:29:01 machineA kernel: drbd2: receiver terminated
> 
> Around 9:00 the secondary ('machineB') came up but nothing happened on the
> primary (this is unexpected, the primary should have had data to
> synchronize).
> 
> At 9:09:35 I restart drbd on the primary ('machineA'):
> 
> Mar 22 09:09:35 machineA kernel: drbd2: Primary/Unknown --> Secondary/Unknown
> Mar 22 09:09:35 machineA kernel: drbd2: drbdsetup [14092]: cstate StandAlone --> Unconnected
> Mar 22 09:09:35 machineA kernel: drbd2: drbdsetup [14092]: cstate Unconnected --> StandAlone
> Mar 22 09:09:35 machineA kernel: drbd2: drbdsetup [14092]: cstate StandAlone --> Unconfigured
> Mar 22 09:09:35 machineA kernel: drbd2: worker terminated
> Mar 22 09:09:37 machineA kernel: drbd2: resync bitmap: bits=7340032 words=229376
> Mar 22 09:09:37 machineA kernel: drbd2: size = 28 GB (29360128 KB)
> Mar 22 09:09:37 machineA kernel: drbd2: 243 MB marked out-of-sync by on disk bit-map.
> Mar 22 09:09:37 machineA kernel: drbd2: Found 6 transactions (324 active
> extents) in activity log.
> Mar 22 09:09:37 machineA kernel: drbd2: drbdsetup [14136]: cstate Unconfigured --> StandAlone
> Mar 22 09:09:37 machineA kernel: drbd2: drbdsetup [14142]: cstate StandAlone --> Unconnected
> Mar 22 09:09:37 machineA kernel: drbd2: drbd2_receiver [14143]: cstate Unconnected --> WFConnection
> Mar 22 09:09:37 machineA kernel: drbd2: drbd2_receiver [14143]: cstate WFConnection --> WFReportParams
> Mar 22 09:09:37 machineA kernel: drbd2: Handshake successful: DRBD Network Protocol version 74
> Mar 22 09:09:37 machineA kernel: drbd2: Connection established.
> Mar 22 09:09:37 machineA kernel: drbd2: I am(S): 1:00000007:00000001:0000001d:00000004:00
> Mar 22 09:09:37 machineA kernel: drbd2: Peer(S): 1:00000007:00000001:0000001b:00000004:00
> Mar 22 09:09:37 machineA kernel: drbd2: drbd2_receiver [14143]: cstate WFReportParams --> WFBitMapS
> Mar 22 09:09:37 machineA kernel: drbd2: Secondary/Unknown --> Secondary/Secondary
> Mar 22 09:09:37 machineA kernel: drbd2: drbd2_receiver [14143]: cstate WFBitMapS --> SyncSource
> Mar 22 09:09:37 machineA kernel: drbd2: Resync started as SyncSource (need to sync 249476 KB [62369 bits set]).
> Mar 22 09:09:49 machineA kernel: drbd2: Resync done (total 11 sec; paused 0 sec; 22676 K/sec)
> Mar 22 09:09:49 machineA kernel: drbd2: drbd2_worker [14137]: cstate SyncSource --> Connected
> 
>   and then I had to tell it that it was, in fact, the primary again:
> 
> Mar 22 09:10:30 machineA kernel: drbd2: Secondary/Secondary --> Primary/Secondary
> 
> Here are the logs from the secondary:
> 
> Mar 22 09:01:23 machineB kernel: drbd: initialised. Version: 0.7.22 (api:79/proto:74)
> Mar 22 09:01:23 machineB kernel: drbd: SVN Revision: 2554 build by lmb at dale, 2006-10-30 22:52:11
> Mar 22 09:01:23 machineB kernel: drbd: registered as block device major 147
> Mar 22 09:01:24 machineB kernel: drbd0: resync bitmap: bits=7340032 words=229376
> Mar 22 09:01:24 machineB kernel: drbd0: size = 28 GB (29360128 KB)
> Mar 22 09:01:24 machineB kernel: drbd0: 0 KB marked out-of-sync by on disk bit-map.
> Mar 22 09:01:24 machineB kernel: drbd0: No usable activity log found.
> Mar 22 09:01:24 machineB kernel: drbd0: drbdsetup [3564]: cstate Unconfigured --> StandAlone
> Mar 22 09:01:24 machineB kernel: drbd0: drbdsetup [3592]: cstate StandAlone --> Unconnected
> Mar 22 09:01:24 machineB kernel: drbd0: drbd0_receiver [3593]: cstate Unconnected --> WFConnection
> 
>    Here I manually restart the drbd on the primary ('machineA').
> 
> Mar 22 09:09:37 machineB kernel: drbd0: drbd0_receiver [3593]: cstate WFConnection --> WFReportParams
> Mar 22 09:09:37 machineB kernel: drbd0: Handshake successful: DRBD Network Protocol version 74
> Mar 22 09:09:37 machineB kernel: drbd0: Connection established.
> Mar 22 09:09:37 machineB kernel: drbd0: I am(S): 1:00000007:00000001:0000001b:00000004:00
> Mar 22 09:09:37 machineB kernel: drbd0: Peer(S): 1:00000007:00000001:0000001d:00000004:00
> Mar 22 09:09:37 machineB kernel: drbd0: drbd0_receiver [3593]: cstate WFReportParams --> WFBitMapT
> Mar 22 09:09:37 machineB kernel: drbd0: Secondary/Unknown --> Secondary/Secondary
> Mar 22 09:09:37 machineB kernel: drbd0: drbd0_receiver [3593]: cstate WFBitMapT --> SyncTarget
> Mar 22 09:09:37 machineB kernel: drbd0: Resync started as SyncTarget (need to sync 249476 KB [62369 bits set]).
> Mar 22 09:09:49 machineB kernel: drbd0: Resync done (total 11 sec; paused 0
> sec; 22676 K/sec)
> Mar 22 09:09:49 machineB kernel: drbd0: drbd0_worker [3580]: cstate SyncTarget --> Connected
> Mar 22 09:10:30 machineB kernel: drbd0: Secondary/Secondary --> Secondary/Primary
> 
> Questions:
> 
> 0. What caused the transition from 'Primary' to 'Standalone' on
>    'machineA'?

it has no network connection (B was down).
so it retries every some seconds to
passively bind() and listen() waiting for an active connect of the peer,
then to realease them again, and actively connect() to the peer itself.
(this is waiting for connection, WFConnection).

for some reason, for one of these attemts, it gets a -98 (EADDRINUSE).
so it quits trying to establish the connection.
(this is no connection, no attemt to connect, so no peer: StandAlone).

> 1. On 'machineA', if it was not the primary, why did it synchronize?

SyncSource does not imply Primary.
Primary does not necessarily mean best and up-to-date data.

it syncs, because it knows it has the better (newer) data.

> 2. Why didn't it move back form 'StandAlone' to 'Primary' when the
>    determination had been made that it's peer was definately already in
>    Secondary?

because "StandAlone" is a connection state,
Primary is a node role.

it has been "StandAlone Primary/Unknown".
you "restart" it. I read this as
  /etc/init.d/drbd stop; /etc/init.d/start

stop makes it secondary (otherwise it could not unconfigure it),
then unconfigures it.

start configures it.

it does not, however, promote to primary.
that would be the job of a cluster manager.
you do have a cluster manager, don't you?

-- 
: Lars Ellenberg                            Tel +43-1-8178292-0  :
: LINBIT Information Technologies GmbH      Fax +43-1-8178292-82 :
: Vivenotgasse 48, A-1120 Vienna/Europe    http://www.linbit.com :
__
please use the "List-Reply" function of your email client.



More information about the drbd-user mailing list