[Drbd-dev] [DRBD-user] A question about drbd hang in conn_disconnect

David Pfarrer dpfarrer at gnubio.com
Thu Apr 16 15:55:39 CEST 2015


unsubscribe

-------------------------------------------------------------
David Pfarrer| Sr IT Support Tech | Bio-Rad Laboratories, Inc.
1 Kendall Square, Suite B14201
Cambridge, MA 02139
E-Mail: dpfarrer at gnubio.com
*TEL*: *617-500-1838*
-------------------------------------------------------------

On Thu, Apr 16, 2015 at 9:46 AM, Lars Ellenberg <lars.ellenberg at linbit.com>
wrote:

> On Mon, Apr 13, 2015 at 05:45:25PM -0600, Fang Sun wrote:
> > Hi drbd developers,
> >
> > After some research and tests I feel I found the reason of this problem
> and
> > a possible fix in drbd.
> > Would you please check if my theory is correct?
> >
> >
> > Let me use 8.4.6 as the code base when I explain it.
> > When conn_disconnect hang it is hanging at line drbd_receiver.c:5178
> > static int drbd_disconnected(struct drbd_peer_device *peer_device)
> > {
> > ........
> > wait_event(device->misc_wait, !test_bit(BITMAP_IO, &device->flags));
> > }
> >
> > The reason is device has flag BITMAP_IO set.
> >
> >
> > The reason why flag BITMAP_IO is set and not clear is:
> > Disk state changes when network is disconnected and after_state_ch is
> > called.
> >
> > At drbd_state.c line 1949 drbd_queue_bitmap_io is called
> inafter_state_ch()
> > .
> >
> > I think the real reason is in drbd_queue_bitmap_io. drbd_main.c line
> 3641.
> > void drbd_queue_bitmap_io(struct drbd_device *device,
> >   int (*io_fn)(struct drbd_device *),
> >   void (*done)(struct drbd_device *, int),
> >   char *why, enum bm_flag flags)
> > {
> > .........
> > set_bit(BITMAP_IO, &device->flags);
> > if (atomic_read(&device->ap_bio_cnt) == 0) {
> > if (!test_and_set_bit(BITMAP_IO_QUEUED, &device->flags))
> > drbd_queue_work(&first_peer_device(device)->connection->sender_work,
> > &device->bm_io_work.w);
> > }
> > ........
> > }
> >
> > In the code the only code to clear BITMAP_IO is in
> > device->bm_io_work.w(w_bitmap_io). But when
> > atomic_read(&device->ap_bio_cnt) != 0 the flag BITMAP_IO  is set, however
> > bm_io_work.w is not called.
> > Then drbd_disconnected() is blocked.
> >
> > Should we move set_bit(BITMAP_IO, &device->flags) to the front of
> > drbd_queue_work()?
>
> No.  That would be the wrong fix,
> and cause potential inconsistencies later.
>
> It may need to be fixed, but in a different way.
>
> Let me (reproduce locally ... and) think about that for a bit.
>
> Thanks,
>
> --
> : Lars Ellenberg
> : http://www.LINBIT.com | Your Way to High Availability
> : DRBD, Linux-HA  and  Pacemaker support and consulting
>
> DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
> __
> please don't Cc me, but send to list   --   I'm subscribed
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-dev/attachments/20150416/2bd85302/attachment.htm>


More information about the drbd-dev mailing list