Can you describe to us the configuration.. were is the Software RAID1? if you can send the mount points.. What filesystem?<br><br>regards!<br><br><div class="gmail_quote">On Sun, Feb 12, 2012 at 9:54 AM, Andreas Bauer <span dir="ltr"><<a href="mailto:ab@voltage.de">ab@voltage.de</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hello,<br>
<br>
this morning I had a nasty situation on a Primary/Secondary DRBD cluster. First the setup:<br>
<br>
Kernel 3.1.0 / DRBD 8.4.11<br>
<br>
KVM virtual machines, running on top of<br>
DRBD (protocol A, sndbuf-size 10M, data-integrity-alg md5, external meta data, meta data on different physical disk), running on top of<br>
LVM2, running on top of<br>
Software RAID 1<br>
<br>
This night there was a MDADM verify run which was still ongoing this morning, when this happened:<br>
<br>
Feb 12 07:03:27 vm-master kernel: [2009644.546758] INFO: task kvm:21152 blocked for more than 120 seconds.<br>
Feb 12 07:03:27 vm-master kernel: [2009644.546813] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.<br>
Feb 12 07:03:27 vm-master kernel: [2009644.546896] kvm D ffff88067fc92f40 0 21152 1 0x00000000<br>
Feb 12 07:03:27 vm-master kernel: [2009644.546901] ffff8806633a4fa0 0000000000000082 ffff880600000000 ffff8806668ec0c0<br>
Feb 12 07:03:27 vm-master kernel: [2009644.546904] 0000000000012f40 ffff88017ae01fd8 ffff88017ae01fd8 ffff8806633a4fa0<br>
Feb 12 07:03:27 vm-master kernel: [2009644.546908] ffff88017ae01fd8 0000000181070175 0000000000000046 ffff8806633225c0<br>
Feb 12 07:03:27 vm-master kernel: [2009644.546911] Call Trace:<br>
Feb 12 07:03:27 vm-master kernel: [2009644.546925] [<ffffffffa00939d1>] ? wait_barrier+0x87/0xc0 [raid1]<br>
Feb 12 07:03:27 vm-master kernel: [2009644.546931] [<ffffffff8103e3ca>] ? try_to_wake_up+0x192/0x192<br>
Feb 12 07:03:27 vm-master kernel: [2009644.546935] [<ffffffffa0096831>] ? make_request+0x111/0x9c8 [raid1]<br>
Feb 12 07:03:27 vm-master kernel: [2009644.546941] [<ffffffffa0102c76>] ? clone_bio+0x43/0xcb [dm_mod]<br>
Feb 12 07:03:27 vm-master kernel: [2009644.546946] [<ffffffffa010386f>] ? __split_and_process_bio+0x4f4/0x506 [dm_mod]<br>
Feb 12 07:03:27 vm-master kernel: [2009644.546952] [<ffffffffa00edd93>] ? md_make_request+0xce/0x1c3 [md_mod]<br>
Feb 12 07:03:27 vm-master kernel: [2009644.546956] [<ffffffff81188919>] ? generic_make_request+0x270/0x2ea<br>
Feb 12 07:03:27 vm-master kernel: [2009644.546960] [<ffffffff81188a66>] ? submit_bio+0xd3/0xf1<br>
Feb 12 07:03:27 vm-master kernel: [2009644.546964] [<ffffffff81119181>] ? __bio_add_page.part.12+0x135/0x1ed<br>
Feb 12 07:03:27 vm-master kernel: [2009644.546967] [<ffffffff8111bbc1>] ? dio_bio_submit+0x6c/0x8a<br>
Feb 12 07:03:27 vm-master kernel: [2009644.546970] [<ffffffff8111becd>] ? dio_send_cur_page+0x6e/0x91<br>
Feb 12 07:03:27 vm-master kernel: [2009644.546972] [<ffffffff8111bfa3>] ? submit_page_section+0xb3/0x11a<br>
Feb 12 07:03:27 vm-master kernel: [2009644.546975] [<ffffffff8111c7d8>] ? __blockdev_direct_IO+0x68a/0x995<br>
Feb 12 07:03:27 vm-master kernel: [2009644.546978] [<ffffffff8111a836>] ? blkdev_direct_IO+0x4e/0x53<br>
Feb 12 07:03:27 vm-master kernel: [2009644.546981] [<ffffffff8111ab5b>] ? blkdev_get_block+0x5b/0x5b<br>
Feb 12 07:03:27 vm-master kernel: [2009644.546985] [<ffffffff810b0fdb>] ? generic_file_direct_write+0xdc/0x146<br>
Feb 12 07:03:27 vm-master kernel: [2009644.546987] [<ffffffff810b11d9>] ? __generic_file_aio_write+0x194/0x278<br>
Feb 12 07:03:27 vm-master kernel: [2009644.546992] [<ffffffff810131f1>] ? paravirt_read_tsc+0x5/0x8<br>
Feb 12 07:03:27 vm-master kernel: [2009644.546995] [<ffffffff810f4501>] ? rw_copy_check_uvector+0x48/0xf8<br>
Feb 12 07:03:27 vm-master kernel: [2009644.546998] [<ffffffff8111ac12>] ? bd_may_claim+0x2e/0x2e<br>
Feb 12 07:03:27 vm-master kernel: [2009644.547000] [<ffffffff8111ac31>] ? blkdev_aio_write+0x1f/0x61<br>
Feb 12 07:03:27 vm-master kernel: [2009644.547003] [<ffffffff8111ac12>] ? bd_may_claim+0x2e/0x2e<br>
Feb 12 07:03:27 vm-master kernel: [2009644.547005] [<ffffffff811249e8>] ? aio_rw_vect_retry+0x70/0x18e<br>
Feb 12 07:03:27 vm-master kernel: [2009644.547008] [<ffffffff81124978>] ? lookup_ioctx+0x53/0x53<br>
Feb 12 07:03:27 vm-master kernel: [2009644.547010] [<ffffffff811253fe>] ? aio_run_iocb+0x70/0x11b<br>
Feb 12 07:03:27 vm-master kernel: [2009644.547013] [<ffffffff8112645b>] ? do_io_submit+0x442/0x4d7<br>
Feb 12 07:03:27 vm-master kernel: [2009644.547017] [<ffffffff81332e12>] ? system_call_fastpath+0x16/0x1b<br>
<br>
Feb 12 07:03:27 vm-master kernel: [2009644.547265] INFO: task md1_resync:15384 blocked for more than 120 seconds.<br>
Feb 12 07:03:27 vm-master kernel: [2009644.547319] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.<br>
Feb 12 07:03:27 vm-master kernel: [2009644.547411] md1_resync D ffff88067fc12f40 0 15384 2 0x00000000<br>
Feb 12 07:03:27 vm-master kernel: [2009644.547414] ffff880537dc0ee0 0000000000000046 0000000000000000 ffffffff8160d020<br>
Feb 12 07:03:27 vm-master kernel: [2009644.547418] 0000000000012f40 ffff8801212affd8 ffff8801212affd8 ffff880537dc0ee0<br>
Feb 12 07:03:27 vm-master kernel: [2009644.547421] 0000000000011210 ffffffff81070175 0000000000000046 ffff8806633225c0<br>
Feb 12 07:03:27 vm-master kernel: [2009644.547424] Call Trace:<br>
Feb 12 07:03:27 vm-master kernel: [2009644.547429] [<ffffffff81070175>] ? arch_local_irq_save+0x11/0x17<br>
Feb 12 07:03:27 vm-master kernel: [2009644.547433] [<ffffffffa0093917>] ? raise_barrier+0x11a/0x14d [raid1]<br>
Feb 12 07:03:27 vm-master kernel: [2009644.547436] [<ffffffff8103e3ca>] ? try_to_wake_up+0x192/0x192<br>
Feb 12 07:03:27 vm-master kernel: [2009644.547440] [<ffffffffa0095fc7>] ? sync_request+0x192/0x70b [raid1]<br>
Feb 12 07:03:27 vm-master kernel: [2009644.547446] [<ffffffffa00f147c>] ? md_do_sync+0x760/0xb64 [md_mod]<br>
Feb 12 07:03:27 vm-master kernel: [2009644.547450] [<ffffffff8103493d>] ? set_task_rq+0x23/0x35<br>
Feb 12 07:03:27 vm-master kernel: [2009644.547454] [<ffffffff8105ec6b>] ? add_wait_queue+0x3c/0x3c<br>
Feb 12 07:03:27 vm-master kernel: [2009644.547459] [<ffffffffa00ee28a>] ? md_thread+0x101/0x11f [md_mod]<br>
Feb 12 07:03:27 vm-master kernel: [2009644.547464] [<ffffffffa00ee189>] ? md_rdev_init+0xea/0xea [md_mod]<br>
Feb 12 07:03:27 vm-master kernel: [2009644.547467] [<ffffffff8105e625>] ? kthread+0x76/0x7e<br>
Feb 12 07:03:27 vm-master kernel: [2009644.547470] [<ffffffff81334f74>] ? kernel_thread_helper+0x4/0x10<br>
Feb 12 07:03:27 vm-master kernel: [2009644.547473] [<ffffffff8105e5af>] ? kthread_worker_fn+0x139/0x139<br>
Feb 12 07:03:27 vm-master kernel: [2009644.547475] [<ffffffff81334f70>] ? gs_change+0x13/0x13<br>
<br>
Also, at 06:50 one of the virtual machines did start to generate heavy I/O load (backup). So this probably together with the resync run set the scene for this to happen.<br>
<br>
Afterwards all DRBD devices were "blocking", e.g. all virtual machines running on these devices were unresponsive and had to be shutdown "cold".<br>
<br>
I also tried to disconnect & reconnect to the peer the DRBD devices, but to no avail. The server had to be restarted.<br>
<br>
Are there any known problems with DRBD on top of a MDADM raid? I noticed that unlike for regular access from userspace, DRBD access to the RAID seems to not throttle the verify/rebuild run down to idle speed (1000K/s) but rather keep going at nearly full speed ?!<br>
<br>
Thanks, Andreas<br>
_______________________________________________<br>
drbd-user mailing list<br>
<a href="mailto:drbd-user@lists.linbit.com">drbd-user@lists.linbit.com</a><br>
<a href="http://lists.linbit.com/mailman/listinfo/drbd-user" target="_blank">http://lists.linbit.com/mailman/listinfo/drbd-user</a><br>
</blockquote></div><br>