[DRBD-user] "local disk flush failed with status -5" on LVM

Tue May 13 10:58:29 CEST 2008

On Tue, May 13, 2008 at 10:50:22AM +0200, Lars Ellenberg wrote:
> On Sat, May 10, 2008 at 12:28:00PM +0200, Iustin Pop wrote:
> > Philipp Reisner wrote:
> > > Am Sonntag, 4. Mai 2008 02:19:12 schrieb Wolfgang Denk:
> > > > Hi,
> > > >
> > > > I'm trying to run DRBD on top of a LV, and get flooded with above
> > > > error messages. I know this has been discussed before, see threads
> > > > starting at
> > > > http://lists.linbit.com/pipermail/drbd-user/2008-February/008665.html
> > > > and
> > > > http://lists.linbit.com/pipermail/drbd-user/2008-February/008519.html
> > > >
> > > > When this was discussed in February, it sounded (at least to me) as is
> > > > a fix was on the way, see
> > > > http://lists.linbit.com/pipermail/drbd-user/2008-February/008692.html
> > > >
> > > > However, even top of tree from the git repo still shows the same
> > > > behaviour.
> > > >
> > > > Am I missing something, or is this usage mode so exotic  that  nobody
> > > > cares?
> > > >
> > > 
> > > Hi Wolfgang,
> > > 
> > > That is actually a kernel bug, I think in 2.6.24. Was fixed later, do not
> > > know by heart with which "sucker" release. I guess it is fixed in 2.6.25.
> > > 
> > > Starting with 8.0.12 we offer a workaround for this in DRBD (and 8.2.6 
> > > when I finally find the time to finish it):
> > > 
> > >   Add no-disk-flushes and no-md-flushes to your disk config.
> > 
> > Because this happens not only with LVM, but with any I/O subsystem that
> > returns wrong error codes from flushes (e.g. broken scsi drivers or
> > controller, I think), would it be a sane thing to disable barriers
> > automatically if there after a certain number of errors?
> > 
> > (Looking at the barrier flush code I see that only the drbd_receiver.c
> > has code for auto-disabling in case of EOPNOTSUPP, but drbd_actlog and
> > drbd_bitmap.c don't; maybe these too should have this).
> 
> hm?
> I think we do have a retry-and-disable-barriers in those places too.

I must be wrong then; I'm looking at the drbd 8.0 git tree, and I see in
drbd_bitmap.c:

        if (rw == WRITE) {
                /* swap back endianness */
                bm_lel_to_cpu(b);
                /* flush bitmap to stable storage */
                if (!test_bit(MD_NO_BARRIER,&mdev->flags))
                        blkdev_issue_flush(mdev->bc->md_bdev, NULL);

(around line 745). This just issues the flush, and no retry/disable in place
(it uses the same blkdev_issue_flush as drbd_receiver.c, and there's no check
of the return value).

What am I missing here? Wrong git tree?

> > The reason I propose this is because with many deployments on different
> > machines it would be better to let it always enabled at startup and
> > allow it to autodisable if it see EOPNOTSUPP
> 
> that is the way we do it.
> 
> > or too many other errors.
> 
> and that is what we don't.

Would it make sense to do it if no blkdev_issue_flush is ever successfull?

> > And people can't always track latest upstream kernel...
> 
> if they are stuck with a kernel where DRBD spits out too much
> noise due to barrier requests throwing IO errors,
> then they have to disable use of barriers in the drbd config.

Ok, let me explain some more. If you have deployments on the order of hundreds
of machines, with various types of controllers, it would be easier to let the
config always have barriers enabled and rely on auto-disable if *no single
flush is ever successfull*.

What do you think?

regards,
iustin