[DRBD-user] Kernel hung on DRBD / MD RAID

Eduardo Diaz - Gmail ediazrod at gmail.com
Mon Feb 13 19:05:32 CET 2012

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Can you describe to us the configuration.. were is the Software RAID1? if
you can send the mount points.. What filesystem?

regards!

On Sun, Feb 12, 2012 at 9:54 AM, Andreas Bauer <ab at voltage.de> wrote:

> Hello,
>
> this morning I had a nasty situation on a Primary/Secondary DRBD cluster.
> First the setup:
>
> Kernel 3.1.0 / DRBD 8.4.11
>
> KVM virtual machines, running on top of
> DRBD (protocol A, sndbuf-size 10M, data-integrity-alg md5, external meta
> data, meta data on different physical disk), running on top of
> LVM2, running on top of
> Software RAID 1
>
> This night there was a MDADM verify run which was still ongoing this
> morning, when this happened:
>
> Feb 12 07:03:27 vm-master kernel: [2009644.546758] INFO: task kvm:21152
> blocked for more than 120 seconds.
> Feb 12 07:03:27 vm-master kernel: [2009644.546813] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Feb 12 07:03:27 vm-master kernel: [2009644.546896] kvm             D
> ffff88067fc92f40     0 21152      1 0x00000000
> Feb 12 07:03:27 vm-master kernel: [2009644.546901]  ffff8806633a4fa0
> 0000000000000082 ffff880600000000 ffff8806668ec0c0
> Feb 12 07:03:27 vm-master kernel: [2009644.546904]  0000000000012f40
> ffff88017ae01fd8 ffff88017ae01fd8 ffff8806633a4fa0
> Feb 12 07:03:27 vm-master kernel: [2009644.546908]  ffff88017ae01fd8
> 0000000181070175 0000000000000046 ffff8806633225c0
> Feb 12 07:03:27 vm-master kernel: [2009644.546911] Call Trace:
> Feb 12 07:03:27 vm-master kernel: [2009644.546925]  [<ffffffffa00939d1>] ?
> wait_barrier+0x87/0xc0 [raid1]
> Feb 12 07:03:27 vm-master kernel: [2009644.546931]  [<ffffffff8103e3ca>] ?
> try_to_wake_up+0x192/0x192
> Feb 12 07:03:27 vm-master kernel: [2009644.546935]  [<ffffffffa0096831>] ?
> make_request+0x111/0x9c8 [raid1]
> Feb 12 07:03:27 vm-master kernel: [2009644.546941]  [<ffffffffa0102c76>] ?
> clone_bio+0x43/0xcb [dm_mod]
> Feb 12 07:03:27 vm-master kernel: [2009644.546946]  [<ffffffffa010386f>] ?
> __split_and_process_bio+0x4f4/0x506 [dm_mod]
> Feb 12 07:03:27 vm-master kernel: [2009644.546952]  [<ffffffffa00edd93>] ?
> md_make_request+0xce/0x1c3 [md_mod]
> Feb 12 07:03:27 vm-master kernel: [2009644.546956]  [<ffffffff81188919>] ?
> generic_make_request+0x270/0x2ea
> Feb 12 07:03:27 vm-master kernel: [2009644.546960]  [<ffffffff81188a66>] ?
> submit_bio+0xd3/0xf1
> Feb 12 07:03:27 vm-master kernel: [2009644.546964]  [<ffffffff81119181>] ?
> __bio_add_page.part.12+0x135/0x1ed
> Feb 12 07:03:27 vm-master kernel: [2009644.546967]  [<ffffffff8111bbc1>] ?
> dio_bio_submit+0x6c/0x8a
> Feb 12 07:03:27 vm-master kernel: [2009644.546970]  [<ffffffff8111becd>] ?
> dio_send_cur_page+0x6e/0x91
> Feb 12 07:03:27 vm-master kernel: [2009644.546972]  [<ffffffff8111bfa3>] ?
> submit_page_section+0xb3/0x11a
> Feb 12 07:03:27 vm-master kernel: [2009644.546975]  [<ffffffff8111c7d8>] ?
> __blockdev_direct_IO+0x68a/0x995
> Feb 12 07:03:27 vm-master kernel: [2009644.546978]  [<ffffffff8111a836>] ?
> blkdev_direct_IO+0x4e/0x53
> Feb 12 07:03:27 vm-master kernel: [2009644.546981]  [<ffffffff8111ab5b>] ?
> blkdev_get_block+0x5b/0x5b
> Feb 12 07:03:27 vm-master kernel: [2009644.546985]  [<ffffffff810b0fdb>] ?
> generic_file_direct_write+0xdc/0x146
> Feb 12 07:03:27 vm-master kernel: [2009644.546987]  [<ffffffff810b11d9>] ?
> __generic_file_aio_write+0x194/0x278
> Feb 12 07:03:27 vm-master kernel: [2009644.546992]  [<ffffffff810131f1>] ?
> paravirt_read_tsc+0x5/0x8
> Feb 12 07:03:27 vm-master kernel: [2009644.546995]  [<ffffffff810f4501>] ?
> rw_copy_check_uvector+0x48/0xf8
> Feb 12 07:03:27 vm-master kernel: [2009644.546998]  [<ffffffff8111ac12>] ?
> bd_may_claim+0x2e/0x2e
> Feb 12 07:03:27 vm-master kernel: [2009644.547000]  [<ffffffff8111ac31>] ?
> blkdev_aio_write+0x1f/0x61
> Feb 12 07:03:27 vm-master kernel: [2009644.547003]  [<ffffffff8111ac12>] ?
> bd_may_claim+0x2e/0x2e
> Feb 12 07:03:27 vm-master kernel: [2009644.547005]  [<ffffffff811249e8>] ?
> aio_rw_vect_retry+0x70/0x18e
> Feb 12 07:03:27 vm-master kernel: [2009644.547008]  [<ffffffff81124978>] ?
> lookup_ioctx+0x53/0x53
> Feb 12 07:03:27 vm-master kernel: [2009644.547010]  [<ffffffff811253fe>] ?
> aio_run_iocb+0x70/0x11b
> Feb 12 07:03:27 vm-master kernel: [2009644.547013]  [<ffffffff8112645b>] ?
> do_io_submit+0x442/0x4d7
> Feb 12 07:03:27 vm-master kernel: [2009644.547017]  [<ffffffff81332e12>] ?
> system_call_fastpath+0x16/0x1b
>
> Feb 12 07:03:27 vm-master kernel: [2009644.547265] INFO: task
> md1_resync:15384 blocked for more than 120 seconds.
> Feb 12 07:03:27 vm-master kernel: [2009644.547319] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Feb 12 07:03:27 vm-master kernel: [2009644.547411] md1_resync      D
> ffff88067fc12f40     0 15384      2 0x00000000
> Feb 12 07:03:27 vm-master kernel: [2009644.547414]  ffff880537dc0ee0
> 0000000000000046 0000000000000000 ffffffff8160d020
> Feb 12 07:03:27 vm-master kernel: [2009644.547418]  0000000000012f40
> ffff8801212affd8 ffff8801212affd8 ffff880537dc0ee0
> Feb 12 07:03:27 vm-master kernel: [2009644.547421]  0000000000011210
> ffffffff81070175 0000000000000046 ffff8806633225c0
> Feb 12 07:03:27 vm-master kernel: [2009644.547424] Call Trace:
> Feb 12 07:03:27 vm-master kernel: [2009644.547429]  [<ffffffff81070175>] ?
> arch_local_irq_save+0x11/0x17
> Feb 12 07:03:27 vm-master kernel: [2009644.547433]  [<ffffffffa0093917>] ?
> raise_barrier+0x11a/0x14d [raid1]
> Feb 12 07:03:27 vm-master kernel: [2009644.547436]  [<ffffffff8103e3ca>] ?
> try_to_wake_up+0x192/0x192
> Feb 12 07:03:27 vm-master kernel: [2009644.547440]  [<ffffffffa0095fc7>] ?
> sync_request+0x192/0x70b [raid1]
> Feb 12 07:03:27 vm-master kernel: [2009644.547446]  [<ffffffffa00f147c>] ?
> md_do_sync+0x760/0xb64 [md_mod]
> Feb 12 07:03:27 vm-master kernel: [2009644.547450]  [<ffffffff8103493d>] ?
> set_task_rq+0x23/0x35
> Feb 12 07:03:27 vm-master kernel: [2009644.547454]  [<ffffffff8105ec6b>] ?
> add_wait_queue+0x3c/0x3c
> Feb 12 07:03:27 vm-master kernel: [2009644.547459]  [<ffffffffa00ee28a>] ?
> md_thread+0x101/0x11f [md_mod]
> Feb 12 07:03:27 vm-master kernel: [2009644.547464]  [<ffffffffa00ee189>] ?
> md_rdev_init+0xea/0xea [md_mod]
> Feb 12 07:03:27 vm-master kernel: [2009644.547467]  [<ffffffff8105e625>] ?
> kthread+0x76/0x7e
> Feb 12 07:03:27 vm-master kernel: [2009644.547470]  [<ffffffff81334f74>] ?
> kernel_thread_helper+0x4/0x10
> Feb 12 07:03:27 vm-master kernel: [2009644.547473]  [<ffffffff8105e5af>] ?
> kthread_worker_fn+0x139/0x139
> Feb 12 07:03:27 vm-master kernel: [2009644.547475]  [<ffffffff81334f70>] ?
> gs_change+0x13/0x13
>
> Also, at 06:50 one of the virtual machines did start to generate heavy I/O
> load (backup). So this probably together with the resync run set the scene
> for this to happen.
>
> Afterwards all DRBD devices were "blocking", e.g. all virtual machines
> running on these devices were unresponsive and had to be shutdown "cold".
>
> I also tried to disconnect & reconnect to the peer the DRBD devices, but
> to no avail. The server had to be restarted.
>
> Are there any known problems with DRBD on top of a MDADM raid? I noticed
> that unlike for regular access from userspace, DRBD access to the RAID
> seems to not throttle the verify/rebuild run down to idle speed (1000K/s)
> but rather keep going at nearly full speed ?!
>
> Thanks, Andreas
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20120213/9b8406cc/attachment.htm>


More information about the drbd-user mailing list