[DRBD-user] receiver & asender dying after a stonith recovery

Dave Dykstra dwdha at drdykstra.us
Wed Mar 23 17:02:29 CET 2005

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Thanks for the explanation.  I think it's something that I shouldn't
worry about happening on the production cluster because the scenario of
both network connections going down on the live server, necessitating
a stonith hit on it, seems pretty unlikely.  It seems much more likely
that the live server will fail catastrophically all at once and then we
wouldn't have this problem.  I just need to come up with some other test
scenario to force a stonith hit, anything that will halt the live server
without removing the power.

- Dave

On Wed, Mar 23, 2005 at 11:37:04AM +0100, Philipp Reisner wrote:
> Am Dienstag, 22. M?rz 2005 22:55 schrieb Dave Dykstra:
> > I've been working on getting heartbeat's stonith to function properly on
> > my cluster that's using drbd.  I've got it to the point where I can unplug
> > the two network connections on the live server (one is a direct connect
> > between the two servers, which drbd uses, and the other is the main company
> > network) and stonith will temporarily remove power from the live server.
> > I always plug in the networks again as soon as the power comes back up.
> > The problem I'm having is that almost every time when that server comes
> > back up, drbd on the new live server does not re-establish communication
> > and the receiver and asender are not running.  If I then manually run
> > 'drbdadm adjust all' on the new live server everything comes back up.
> > Below is /var/adm/messages from one of the cases.  Time 15:19:53 is when I
> > ran 'drbdadm adjust'.  Can anybody explain what's going on?  Am I supposed
> > to be having heartbeat doing something more so that 'drbdadm adjust'
> > will run?
> >
> 
> I can. 
> 
> I think that you have found the weak point in the design of the generation 
> counters, I became aware of in January.
> 
> Actually you have a double fault:
> 
>  1st Complete Network failure
>  2nd Power failure on the former primary.
> 
> You might have a look at 
> http://www.drbd.org/fileadmin/drbd/publications/drbd_paper_for_NLUUG_2001.pdf
> and other more recent papers, to see what happens.
> 
> I am in the progress to come up with a new scheme of data generation 
> identifying for drbd-0.8. For drbd-0.7 things will stay as they are.
> 
> Item 16 of http://svn.drbd.org/drbd/trunk/ROADMAP, is still wrong
> and unfinished, but outlines the ideas how to get this right in the
> future.
> 
> -Philipp



More information about the drbd-user mailing list