Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Thu, 22 Mar 2007, Lars Ellenberg wrote: ... > > 0. What caused the transition from 'Primary' to 'Standalone' on > > 'machineA'? > > it has no network connection (B was down). > so it retries every some seconds to > passively bind() and listen() waiting for an active connect of the peer, > then to realease them again, and actively connect() to the peer itself. > (this is waiting for connection, WFConnection). > > for some reason, for one of these attemts, it gets a -98 (EADDRINUSE). > so it quits trying to establish the connection. > (this is no connection, no attemt to connect, so no peer: StandAlone). Let me prefix the remainder of my response with a "thank-you", I was pleasantly surprised by the quick response. Regarding the explanation: Aha! That's what I thought, but I can't determine why it it would get EADDRINUSE at 3-something in the AM. > > 1. On 'machineA', if it was not the primary, why did it synchronize? > > SyncSource does not imply Primary. > Primary does not necessarily mean best and up-to-date data. > > it syncs, because it knows it has the better (newer) data. Aha. OK, that clears that up. > > 2. Why didn't it move back form 'StandAlone' to 'Primary' when the > > determination had been made that it's peer was definately already in > > Secondary? > > because "StandAlone" is a connection state, > Primary is a node role. I think I understand now. > it has been "StandAlone Primary/Unknown". > you "restart" it. I read this as > /etc/init.d/drbd stop; /etc/init.d/start More or less, that is correct. > stop makes it secondary (otherwise it could not unconfigure it), > then unconfigures it. > > start configures it. > > it does not, however, promote to primary. > that would be the job of a cluster manager. > you do have a cluster manager, don't you? No, I don't have a traditional cluster set up here, I'm doing some testing and probably doing some things that aren't along the usual uses of drbd (no heartbeat for example). In this case, I've got what is essentially an opportunistic network raid1 - when machineB comes up (infrequently) it gets a copy of only what has changed and when up to date it is a convenient mirror. If machineA's backing store ever died (actually it did less than a week ago) I can replace it, bring it and machineB up, and machineB will sync back to machineA and then I can reset the drbd on machineA as primary at which point I can use it again. I know, it's probably a little weird. All that really remains is to try to understand why drbd failed a bind call at 3:29:01. I can guarantee that no cron job ran and the machine was otherwise idle, and have otherwise scoured the logs for anything that might be relevant and found, unfortunately, nothing - drbd is the only thing logged for several minutes in either direction. -- Jon Nelson <jnelson-drbd at jamponi.net>