[DRBD-user] A question about drbd hang in conn_disconnect

Lars Ellenberg lars.ellenberg at linbit.com
Thu Apr 16 15:46:44 CEST 2015

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


On Mon, Apr 13, 2015 at 05:45:25PM -0600, Fang Sun wrote:
> Hi drbd developers,
> 
> After some research and tests I feel I found the reason of this problem and
> a possible fix in drbd.
> Would you please check if my theory is correct?
> 
> 
> Let me use 8.4.6 as the code base when I explain it.
> When conn_disconnect hang it is hanging at line drbd_receiver.c:5178
> static int drbd_disconnected(struct drbd_peer_device *peer_device)
> {
> ........
> wait_event(device->misc_wait, !test_bit(BITMAP_IO, &device->flags));
> }
> 
> The reason is device has flag BITMAP_IO set.
> 
> 
> The reason why flag BITMAP_IO is set and not clear is:
> Disk state changes when network is disconnected and after_state_ch is
> called.
> 
> At drbd_state.c line 1949 drbd_queue_bitmap_io is called inafter_state_ch()
> .
> 
> I think the real reason is in drbd_queue_bitmap_io. drbd_main.c line 3641.
> void drbd_queue_bitmap_io(struct drbd_device *device,
>   int (*io_fn)(struct drbd_device *),
>   void (*done)(struct drbd_device *, int),
>   char *why, enum bm_flag flags)
> {
> .........
> set_bit(BITMAP_IO, &device->flags);
> if (atomic_read(&device->ap_bio_cnt) == 0) {
> if (!test_and_set_bit(BITMAP_IO_QUEUED, &device->flags))
> drbd_queue_work(&first_peer_device(device)->connection->sender_work,
> &device->bm_io_work.w);
> }
> ........
> }
> 
> In the code the only code to clear BITMAP_IO is in
> device->bm_io_work.w(w_bitmap_io). But when
> atomic_read(&device->ap_bio_cnt) != 0 the flag BITMAP_IO  is set, however
> bm_io_work.w is not called.
> Then drbd_disconnected() is blocked.
> 
> Should we move set_bit(BITMAP_IO, &device->flags) to the front of
> drbd_queue_work()?

No.  That would be the wrong fix,
and cause potential inconsistencies later.

It may need to be fixed, but in a different way.

Let me (reproduce locally ... and) think about that for a bit.

Thanks,

-- 
: Lars Ellenberg
: http://www.LINBIT.com | Your Way to High Availability
: DRBD, Linux-HA  and  Pacemaker support and consulting

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed



More information about the drbd-user mailing list