[DRBD-user] Raid10 write storm

Sat Dec 10 07:16:02 CET 2011

+DRBD group, which seems to have many linux disk gurus

The problem:
MD seems to refuse submitting read IO when page flush submits write io.

I'm completely stumbled, no matter how hard I tweak the deadline
scheduler, it don't seem to make any big difference at all !  noops
scheduler has same basic symptom too.

I think the only logical explanation is that during the big write
storm (generated by page flush), MD is not submitting any read IO to
the under laying device, therefore scheduler can not prioritize read
and the whole thing is just destroyed.

I can't be the only one that is being bite by this If any one want to
simulate what's happening, here's a good way:

Ubuntu 10.04LTS kernel 2.6.32, I tested 2.6.35 2.6.38 have same issue

1.  setup page cache
echo $((64*1024*1024)) >> /proc/sys/vm/dirty_bytes
echo $((16*1024*1024)) > /proc/sys/vm/dirty_background_bytes

2. setup a raid 10 with 4 disk, have something that generate constant
read IO,  then have another dd generating constant write IO , size
doesn't matter, as long you have some IO goes on

2.  watch /proc/meminfo to see dirty page count reaching 16M , and
wach iostat -x -d 1  when flusher flush the requests.
you can see a period of time that NO read io is finished at all on all
disks the disk is all doing write IO, this defeated the whole point of
using page cache as a background write-back cache.

If any one have similar issue and know how to deal with it please tell
me, thanks in advance!

On Wed, Dec 7, 2011 at 10:31 PM, Yucong Sun (叶雨飞) <sunyucong at gmail.com> wrote:
> sadly the patch didn't help ,
>
> sadly, the patch didn't help at all, see following
>
>
> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
> avgrq-sz avgqu-sz   await  svctm  %util
> sda               0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00   0.00   0.00
> sdb               0.00  2042.00    0.00  345.00     0.00 64112.00
> 185.83    93.13   93.36   2.12  73.00
> sdd               0.00  1704.00    7.00  156.00    56.00 12496.00
> 77.01    95.71  146.20   3.62  59.00
> sdc               0.00  1518.00   16.00  185.00   128.00  9936.00
> 50.07    98.20  157.41   3.13  63.00
> sde             222.00  1997.00  194.00  189.00 51568.00 16488.00
> 177.69    81.54   99.09   2.25  86.00
> md0               0.00     0.00   37.00 4096.00   296.00 32768.00
> 8.00     0.00    0.00   0.00   0.00
>
>
> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
> avgrq-sz avgqu-sz   await  svctm  %util
> sda               0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00   0.00   0.00
> sdb               0.00   150.00    0.00  194.00     0.00 33336.00
> 171.84    34.91  492.84   4.59  89.00
> sdd               0.00     0.00    0.00  138.00     0.00  3488.00
> 25.28    32.68  757.75   4.06  56.00
> sdc               0.00     0.00    3.00  127.00    24.00  4704.00
> 36.37    33.68  771.08   4.54  59.00
> sde             222.00     0.00   90.00   84.00 39936.00  1672.00
> 239.13    23.73  386.90   4.08  71.00
> md0               0.00     0.00    2.00    0.00    16.00     0.00
> 8.00     0.00    0.00   0.00   0.00
>
>
> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
> avgrq-sz avgqu-sz   await  svctm  %util
> sda               0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00   0.00   0.00
> sdb               0.00   235.00    0.00  188.00     0.00 54024.00
> 287.36     0.49    3.78   1.65  31.00
> sdd               0.00     0.00   27.00    0.00   216.00     0.00
> 8.00     0.15    5.56   5.56  15.00
> sdc               0.00     0.00   46.00    0.00   368.00     0.00
> 8.00     0.32    6.52   6.96  32.00
> sde             165.00     0.00  200.00    0.00 43480.00     0.00
> 217.40     7.63   38.15   2.00  40.00
> md0               0.00     0.00  101.00    0.00   808.00     0.00
> 8.00     0.00    0.00   0.00   0.00
>
> I poked around and found this, when big flush comes in ,
>
> Every 1.0s: cat /sys/block/sdb/stat /sys/block/sdc/stat
> /sys/block/sdd/stat /sys/block/sde/stat /sys/block/md0/stat
>                       Wed Dec  7 22:26:14 2011
>
>       32       10      336      270  2792623  5501730 783168880
> 254952160      284  4815060 255014270
>  2993481  2222268 499586400 94384090   493165  1842192 18671608
> 271311440      290  9942910 365758660
>   691727       19  5533896  1507300   501261  1838497 18706544
> 276987570      262  3254420 278552760
>  1458797  1404948 281875858 49664210   483386  1841832 18588928
> 256627020      259  4997270 306348180
>  2797538        0 22380058        0  4652939        0 37223512
> 0        0        0        0
>
> Every downstream disk have a Huge in-flight IO jump, where it is
> usually just 0 or 1 the whole time.  The kernel document says this is
> don't include queued IO, so I think the problem is because IO
> scheduler issued too many requests to the device , without throttling
> read/write,  that basically saturated the disk, so no other read can
> be scheduled, do you knwo why this would happen to me?
>
> Here's my relevenat scheduler tweak:
>
> for disk in /sys/block/sd[bcde]
> do
>        echo "changing $disk scheduler"
>        echo "deadline" > $disk/queue/scheduler
>
>        echo "changing $disk nr_reqests to 4096"
>        echo 4096 > $disk/queue/nr_requests
>
>        echo "setra to 0"
>        echo 0 > $disk/queue/read_ahead_kb
>
>        echo "tweaking deadline io"
>        echo 32 > $disk/queue/iosched/fifo_batch
>        echo 30 > $disk/queue/iosched/read_expire
>        echo 20000 > $disk/queue/iosched/write_expire
>        echo 256 > $disk/queue/iosched/writes_starved
> done
>
> echo 0 > /sys/block/md0/queue/read_ahead_kb
>
> My workload profile is 100% random 8K IO.
>
> Come to think of it, the problem is mostly IO scheduling issue, does
> nr_requests mean anything to MD? it's not possible to adjust it
> either, was that the reason that MD can't accept more reads?
> On Wed, Dec 7, 2011 at 4:10 PM, NeilBrown <neilb at suse.de> wrote:
>>
>> On Wed, 7 Dec 2011 15:37:30 -0800 Yucong Sun (叶雨飞) <sunyucong at gmail.com>
>> wrote:
>>
>> > Neil, I can't compile latest MD against 2.6.32,  and that commit can't
>> > be patched into 2.6.32 directly either, can you help me on this?
>> >
>>
>> This should do it.
>>
>> NeilBrown
>>
>> commit ef54b7cf955dc3b7d33248e8591b1a00b4fa998c
>> Author: NeilBrown <neilb at suse.de>
>> Date:   Tue Oct 11 16:50:01 2011 +1100
>>
>>    md: add proper write-congestion reporting to RAID1 and RAID10.
>>
>>    RAID1 and RAID10 handle write requests by queuing them for handling by
>>    a separate thread.  This is because when a write-intent-bitmap is
>>    active we might need to update the bitmap first, so it is good to
>>    queue a lot of writes, then do one big bitmap update for them all.
>>
>>    However writeback request devices to appear to be congested after a
>>    while so it can make some guesstimate of throughput.  The infinite
>>    queue defeats that (note that RAID5 has already has a finite queue so
>>    it doesn't suffer from this problem).
>>
>>    So impose a limit on the number of pending write requests.  By default
>>    it is 1024 which seems to be generally suitable.  Make it configurable
>>    via module option just in case someone finds a regression.
>>
>>    Signed-off-by: NeilBrown <neilb at suse.de>
>>
>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>> index e07ce2e..fe7ae3c 100644
>> --- a/drivers/md/raid1.c
>> +++ b/drivers/md/raid1.c
>> @@ -50,6 +50,11 @@
>>  */
>>  #define        NR_RAID1_BIOS 256
>>
>> +/* When there are this many requests queue to be written by
>> + * the raid1 thread, we become 'congested' to provide back-pressure
>> + * for writeback.
>> + */
>> +static int max_queued_requests = 1024;
>>
>>  static void unplug_slaves(mddev_t *mddev);
>>
>> @@ -576,7 +581,8 @@ static int raid1_congested(void *data, int bits)
>>        conf_t *conf = mddev->private;
>>        int i, ret = 0;
>>
>> -       if (mddev_congested(mddev, bits))
>> +       if (mddev_congested(mddev, bits) &&
>> +           conf->pending_count >= max_queued_requests)
>>                return 1;
>>
>>        rcu_read_lock();
>> @@ -613,10 +619,12 @@ static int flush_pending_writes(conf_t *conf)
>>                struct bio *bio;
>>                bio = bio_list_get(&conf->pending_bio_list);
>>                blk_remove_plug(conf->mddev->queue);
>> +               conf->pending_count = 0;
>>                spin_unlock_irq(&conf->device_lock);
>>                /* flush any pending bitmap writes to
>>                 * disk before proceeding w/ I/O */
>>                bitmap_unplug(conf->mddev->bitmap);
>> +               wake_up(&conf->wait_barrier);
>>
>>                while (bio) { /* submit pending writes */
>>                        struct bio *next = bio->bi_next;
>> @@ -789,6 +797,7 @@ static int make_request(struct request_queue *q, struct bio * bio)
>>        int cpu;
>>        bool do_barriers;
>>        mdk_rdev_t *blocked_rdev;
>> +       int cnt = 0;
>>
>>        /*
>>         * Register the new request and wait if the reconstruction
>> @@ -864,6 +873,11 @@ static int make_request(struct request_queue *q, struct bio * bio)
>>        /*
>>         * WRITE:
>>         */
>> +       if (conf->pending_count >= max_queued_requests) {
>> +               md_wakeup_thread(mddev->thread);
>> +               wait_event(conf->wait_barrier,
>> +                          conf->pending_count < max_queued_requests);
>> +       }
>>        /* first select target devices under spinlock and
>>         * inc refcount on their rdev.  Record them by setting
>>         * bios[x] to bio
>> @@ -970,6 +984,7 @@ static int make_request(struct request_queue *q, struct bio * bio)
>>                atomic_inc(&r1_bio->remaining);
>>
>>                bio_list_add(&bl, mbio);
>> +               cnt++;
>>        }
>>        kfree(behind_pages); /* the behind pages are attached to the bios now */
>>
>> @@ -978,6 +993,7 @@ static int make_request(struct request_queue *q, struct bio * bio)
>>        spin_lock_irqsave(&conf->device_lock, flags);
>>        bio_list_merge(&conf->pending_bio_list, &bl);
>>        bio_list_init(&bl);
>> +       conf->pending_count += cnt;
>>
>>        blk_plug_device(mddev->queue);
>>        spin_unlock_irqrestore(&conf->device_lock, flags);
>> @@ -2021,7 +2037,7 @@ static int run(mddev_t *mddev)
>>
>>        bio_list_init(&conf->pending_bio_list);
>>        bio_list_init(&conf->flushing_bio_list);
>> -
>> +       conf->pending_count = 0;
>>
>>        mddev->degraded = 0;
>>        for (i = 0; i < conf->raid_disks; i++) {
>> @@ -2317,3 +2333,5 @@ MODULE_LICENSE("GPL");
>>  MODULE_ALIAS("md-personality-3"); /* RAID1 */
>>  MODULE_ALIAS("md-raid1");
>>  MODULE_ALIAS("md-level-1");
>> +
>> +module_param(max_queued_requests, int, S_IRUGO|S_IWUSR);
>> diff --git a/drivers/md/raid1.h b/drivers/md/raid1.h
>> index e87b84d..520288c 100644
>> --- a/drivers/md/raid1.h
>> +++ b/drivers/md/raid1.h
>> @@ -38,6 +38,7 @@ struct r1_private_data_s {
>>        /* queue of writes that have been unplugged */
>>        struct bio_list         flushing_bio_list;
>>
>> +       int                     pending_count;
>>        /* for use when syncing mirrors: */
>>
>>        spinlock_t              resync_lock;
>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
>> index c2cb7b8..4c7d9b5 100644
>> --- a/drivers/md/raid10.c
>> +++ b/drivers/md/raid10.c
>> @@ -59,6 +59,11 @@ static void unplug_slaves(mddev_t *mddev);
>>
>>  static void allow_barrier(conf_t *conf);
>>  static void lower_barrier(conf_t *conf);
>> +/* When there are this many requests queue to be written by
>> + * the raid10 thread, we become 'congested' to provide back-pressure
>> + * for writeback.
>> + */
>> +static int max_queued_requests = 1024;
>>
>>  static void * r10bio_pool_alloc(gfp_t gfp_flags, void *data)
>>  {
>> @@ -631,6 +636,10 @@ static int raid10_congested(void *data, int bits)
>>        conf_t *conf = mddev->private;
>>        int i, ret = 0;
>>
>> +       if ((bits & (1 << BDI_async_congested)) &&
>> +           conf->pending_count >= max_queued_requests)
>> +               return 1;
>> +
>>        if (mddev_congested(mddev, bits))
>>                return 1;
>>        rcu_read_lock();
>> @@ -660,10 +669,12 @@ static int flush_pending_writes(conf_t *conf)
>>                struct bio *bio;
>>                bio = bio_list_get(&conf->pending_bio_list);
>>                blk_remove_plug(conf->mddev->queue);
>> +               conf->pending_count = 0;
>>                spin_unlock_irq(&conf->device_lock);
>>                /* flush any pending bitmap writes to disk
>>                 * before proceeding w/ I/O */
>>                bitmap_unplug(conf->mddev->bitmap);
>> +               wake_up(&conf->wait_barrier);
>>
>>                while (bio) { /* submit pending writes */
>>                        struct bio *next = bio->bi_next;
>> @@ -802,6 +813,7 @@ static int make_request(struct request_queue *q, struct bio * bio)
>>        struct bio_list bl;
>>        unsigned long flags;
>>        mdk_rdev_t *blocked_rdev;
>> +       int cnt = 0;
>>
>>        if (unlikely(bio_rw_flagged(bio, BIO_RW_BARRIER))) {
>>                bio_endio(bio, -EOPNOTSUPP);
>> @@ -894,6 +906,11 @@ static int make_request(struct request_queue *q, struct bio * bio)
>>        /*
>>         * WRITE:
>>         */
>> +       if (conf->pending_count >= max_queued_requests) {
>> +               md_wakeup_thread(mddev->thread);
>> +               wait_event(conf->wait_barrier,
>> +                          conf->pending_count < max_queued_requests);
>> +       }
>>        /* first select target devices under rcu_lock and
>>         * inc refcount on their rdev.  Record them by setting
>>         * bios[x] to bio
>> @@ -957,6 +974,7 @@ static int make_request(struct request_queue *q, struct bio * bio)
>>
>>                atomic_inc(&r10_bio->remaining);
>>                bio_list_add(&bl, mbio);
>> +               cnt++
>>        }
>>
>>        if (unlikely(!atomic_read(&r10_bio->remaining))) {
>> @@ -970,6 +988,7 @@ static int make_request(struct request_queue *q, struct bio * bio)
>>        spin_lock_irqsave(&conf->device_lock, flags);
>>        bio_list_merge(&conf->pending_bio_list, &bl);
>>        blk_plug_device(mddev->queue);
>> +       conf->pending_count += cnt;
>>        spin_unlock_irqrestore(&conf->device_lock, flags);
>>
>>        /* In case raid10d snuck in to freeze_array */
>> @@ -2318,3 +2337,5 @@ MODULE_LICENSE("GPL");
>>  MODULE_ALIAS("md-personality-9"); /* RAID10 */
>>  MODULE_ALIAS("md-raid10");
>>  MODULE_ALIAS("md-level-10");
>> +
>> +module_param(max_queued_requests, int, S_IRUGO|S_IWUSR);
>> diff --git a/drivers/md/raid10.h b/drivers/md/raid10.h
>> index 59cd1ef..e6e1613 100644
>> --- a/drivers/md/raid10.h
>> +++ b/drivers/md/raid10.h
>> @@ -39,7 +39,7 @@ struct r10_private_data_s {
>>        struct list_head        retry_list;
>>        /* queue pending writes and submit them on unplug */
>>        struct bio_list         pending_bio_list;
>> -
>> +       int                     pending_count;
>>
>>        spinlock_t              resync_lock;
>>        int nr_pending;
>>
>>