[DRBD-user] Kernel hung on DRBD / MD RAID

Eduardo Diaz - Gmail ediazrod at gmail.com
Mon Feb 13 19:05:32 CET 2012


Can you describe to us the configuration.. were is the Software RAID1? if
you can send the mount points.. What filesystem?

regards!

On Sun, Feb 12, 2012 at 9:54 AM, Andreas Bauer <ab at voltage.de> wrote:

> Hello,
>
> this morning I had a nasty situation on a Primary/Secondary DRBD cluster.
> First the setup:
>
> Kernel 3.1.0 / DRBD 8.4.11
>
> KVM virtual machines, running on top of
> DRBD (protocol A, sndbuf-size 10M, data-integrity-alg md5, external meta
> data, meta data on different physical disk), running on top of
> LVM2, running on top of
> Software RAID 1
>
> This night there was a MDADM verify run which was still ongoing this
> morning, when this happened:
>
> Feb 12 07:03:27 vm-master kernel: [2009644.546758] INFO: task kvm:21152
> blocked for more than 120 seconds.
> Feb 12 07:03:27 vm-master kernel: [2009644.546813] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Feb 12 07:03:27 vm-master kernel: [2009644.546896] kvm             D
> ffff88067fc92f40     0 21152      1 0x00000000
> Feb 12 07:03:27 vm-master kernel: [2009644.546901]  ffff8806633a4fa0
> 0000000000000082 ffff880600000000 ffff8806668ec0c0
> Feb 12 07:03:27 vm-master kernel: [2009644.546904]  0000000000012f40
> ffff88017ae01fd8 ffff88017ae01fd8 ffff8806633a4fa0
> Feb 12 07:03:27 vm-master kernel: [2009644.546908]  ffff88017ae01fd8
> 0000000181070175 0000000000000046 ffff8806633225c0
> Feb 12 07:03:27 vm-master kernel: [2009644.546911] Call Trace:
> Feb 12 07:03:27 vm-master kernel: [2009644.546925]  [<ffffffffa00939d1>] ?
> wait_barrier+0x87/0xc0 [raid1]
> Feb 12 07:03:27 vm-master kernel: [2009644.546931]  [<ffffffff8103e3ca>] ?
> try_to_wake_up+0x192/0x192
> Feb 12 07:03:27 vm-master kernel: [2009644.546935]  [<ffffffffa0096831>] ?
> make_request+0x111/0x9c8 [raid1]
> Feb 12 07:03:27 vm-master kernel: [2009644.546941]  [<ffffffffa0102c76>] ?
> clone_bio+0x43/0xcb [dm_mod]
> Feb 12 07:03:27 vm-master kernel: [2009644.546946]  [<ffffffffa010386f>] ?
> __split_and_process_bio+0x4f4/0x506 [dm_mod]
> Feb 12 07:03:27 vm-master kernel: [2009644.546952]  [<ffffffffa00edd93>] ?
> md_make_request+0xce/0x1c3 [md_mod]
> Feb 12 07:03:27 vm-master kernel: [2009644.546956]  [<ffffffff81188919>] ?
> generic_make_request+0x270/0x2ea
> Feb 12 07:03:27 vm-master kernel: [2009644.546960]  [<ffffffff81188a66>] ?
> submit_bio+0xd3/0xf1
> Feb 12 07:03:27 vm-master kernel: [2009644.546964]  [<ffffffff81119181>] ?
> __bio_add_page.part.12+0x135/0x1ed
> Feb 12 07:03:27 vm-master kernel: [2009644.546967]  [<ffffffff8111bbc1>] ?
> dio_bio_submit+0x6c/0x8a
> Feb 12 07:03:27 vm-master kernel: [2009644.546970]  [<ffffffff8111becd>] ?
> dio_send_cur_page+0x6e/0x91
> Feb 12 07:03:27 vm-master kernel: [2009644.546972]  [<ffffffff8111bfa3>] ?
> submit_page_section+0xb3/0x11a
> Feb 12 07:03:27 vm-master kernel: [2009644.546975]  [<ffffffff8111c7d8>] ?
> __blockdev_direct_IO+0x68a/0x995
> Feb 12 07:03:27 vm-master kernel: [2009644.546978]  [<ffffffff8111a836>] ?
> blkdev_direct_IO+0x4e/0x53
> Feb 12 07:03:27 vm-master kernel: [2009644.546981]  [<ffffffff8111ab5b>] ?
> blkdev_get_block+0x5b/0x5b
> Feb 12 07:03:27 vm-master kernel: [2009644.546985]  [<ffffffff810b0fdb>] ?
> generic_file_direct_write+0xdc/0x146
> Feb 12 07:03:27 vm-master kernel: [2009644.546987]  [<ffffffff810b11d9>] ?
> __generic_file_aio_write+0x194/0x278
> Feb 12 07:03:27 vm-master kernel: [2009644.546992]  [<ffffffff810131f1>] ?
> paravirt_read_tsc+0x5/0x8
> Feb 12 07:03:27 vm-master kernel: [2009644.546995]  [<ffffffff810f4501>] ?
> rw_copy_check_uvector+0x48/0xf8
> Feb 12 07:03:27 vm-master kernel: [2009644.546998]  [<ffffffff8111ac12>] ?
> bd_may_claim+0x2e/0x2e
> Feb 12 07:03:27 vm-master kernel: [2009644.547000]  [<ffffffff8111ac31>] ?
> blkdev_aio_write+0x1f/0x61
> Feb 12 07:03:27 vm-master kernel: [2009644.547003]  [<ffffffff8111ac12>] ?
> bd_may_claim+0x2e/0x2e
> Feb 12 07:03:27 vm-master kernel: [2009644.547005]  [<ffffffff811249e8>] ?
> aio_rw_vect_retry+0x70/0x18e
> Feb 12 07:03:27 vm-master kernel: [2009644.547008]  [<ffffffff81124978>] ?
> lookup_ioctx+0x53/0x53
> Feb 12 07:03:27 vm-master kernel: [2009644.547010]  [<ffffffff811253fe>] ?
> aio_run_iocb+0x70/0x11b
> Feb 12 07:03:27 vm-master kernel: [2009644.547013]  [<ffffffff8112645b>] ?
> do_io_submit+0x442/0x4d7
> Feb 12 07:03:27 vm-master kernel: [2009644.547017]  [<ffffffff81332e12>] ?
> system_call_fastpath+0x16/0x1b
>
> Feb 12 07:03:27 vm-master kernel: [2009644.547265] INFO: task
> md1_resync:15384 blocked for more than 120 seconds.
> Feb 12 07:03:27 vm-master kernel: [2009644.547319] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Feb 12 07:03:27 vm-master kernel: [2009644.547411] md1_resync      D
> ffff88067fc12f40     0 15384      2 0x00000000
> Feb 12 07:03:27 vm-master kernel: [2009644.547414]  ffff880537dc0ee0
> 0000000000000046 0000000000000000 ffffffff8160d020
> Feb 12 07:03:27 vm-master kernel: [2009644.547418]  0000000000012f40
> ffff8801212affd8 ffff8801212affd8 ffff880537dc0ee0
> Feb 12 07:03:27 vm-master kernel: [2009644.547421]  0000000000011210
> ffffffff81070175 0000000000000046 ffff8806633225c0
> Feb 12 07:03:27 vm-master kernel: [2009644.547424] Call Trace:
> Feb 12 07:03:27 vm-master kernel: [2009644.547429]  [<ffffffff81070175>] ?
> arch_local_irq_save+0x11/0x17
> Feb 12 07:03:27 vm-master kernel: [2009644.547433]  [<ffffffffa0093917>] ?
> raise_barrier+0x11a/0x14d [raid1]
> Feb 12 07:03:27 vm-master kernel: [2009644.547436]  [<ffffffff8103e3ca>] ?
> try_to_wake_up+0x192/0x192
> Feb 12 07:03:27 vm-master kernel: [2009644.547440]  [<ffffffffa0095fc7>] ?
> sync_request+0x192/0x70b [raid1]
> Feb 12 07:03:27 vm-master kernel: [2009644.547446]  [<ffffffffa00f147c>] ?
> md_do_sync+0x760/0xb64 [md_mod]
> Feb 12 07:03:27 vm-master kernel: [2009644.547450]  [<ffffffff8103493d>] ?
> set_task_rq+0x23/0x35
> Feb 12 07:03:27 vm-master kernel: [2009644.547454]  [<ffffffff8105ec6b>] ?
> add_wait_queue+0x3c/0x3c
> Feb 12 07:03:27 vm-master kernel: [2009644.547459]  [<ffffffffa00ee28a>] ?
> md_thread+0x101/0x11f [md_mod]
> Feb 12 07:03:27 vm-master kernel: [2009644.547464]  [<ffffffffa00ee189>] ?
> md_rdev_init+0xea/0xea [md_mod]
> Feb 12 07:03:27 vm-master kernel: [2009644.547467]  [<ffffffff8105e625>] ?
> kthread+0x76/0x7e
> Feb 12 07:03:27 vm-master kernel: [2009644.547470]  [<ffffffff81334f74>] ?
> kernel_thread_helper+0x4/0x10
> Feb 12 07:03:27 vm-master kernel: [2009644.547473]  [<ffffffff8105e5af>] ?
> kthread_worker_fn+0x139/0x139
> Feb 12 07:03:27 vm-master kernel: [2009644.547475]  [<ffffffff81334f70>] ?
> gs_change+0x13/0x13
>
> Also, at 06:50 one of the virtual machines did start to generate heavy I/O
> load (backup). So this probably together with the resync run set the scene
> for this to happen.
>
> Afterwards all DRBD devices were "blocking", e.g. all virtual machines
> running on these devices were unresponsive and had to be shutdown "cold".
>
> I also tried to disconnect & reconnect to the peer the DRBD devices, but
> to no avail. The server had to be restarted.
>
> Are there any known problems with DRBD on top of a MDADM raid? I noticed
> that unlike for regular access from userspace, DRBD access to the RAID
> seems to not throttle the verify/rebuild run down to idle speed (1000K/s)
> but rather keep going at nearly full speed ?!
>
> Thanks, Andreas
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20120213/9b8406cc/attachment.htm>


More information about the drbd-user mailing list