Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello, this morning I had a nasty situation on a Primary/Secondary DRBD cluster. First the setup: Kernel 3.1.0 / DRBD 8.4.11 KVM virtual machines, running on top of DRBD (protocol A, sndbuf-size 10M, data-integrity-alg md5, external meta data, meta data on different physical disk), running on top of LVM2, running on top of Software RAID 1 This night there was a MDADM verify run which was still ongoing this morning, when this happened: Feb 12 07:03:27 vm-master kernel: [2009644.546758] INFO: task kvm:21152 blocked for more than 120 seconds. Feb 12 07:03:27 vm-master kernel: [2009644.546813] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Feb 12 07:03:27 vm-master kernel: [2009644.546896] kvm D ffff88067fc92f40 0 21152 1 0x00000000 Feb 12 07:03:27 vm-master kernel: [2009644.546901] ffff8806633a4fa0 0000000000000082 ffff880600000000 ffff8806668ec0c0 Feb 12 07:03:27 vm-master kernel: [2009644.546904] 0000000000012f40 ffff88017ae01fd8 ffff88017ae01fd8 ffff8806633a4fa0 Feb 12 07:03:27 vm-master kernel: [2009644.546908] ffff88017ae01fd8 0000000181070175 0000000000000046 ffff8806633225c0 Feb 12 07:03:27 vm-master kernel: [2009644.546911] Call Trace: Feb 12 07:03:27 vm-master kernel: [2009644.546925] [<ffffffffa00939d1>] ? wait_barrier+0x87/0xc0 [raid1] Feb 12 07:03:27 vm-master kernel: [2009644.546931] [<ffffffff8103e3ca>] ? try_to_wake_up+0x192/0x192 Feb 12 07:03:27 vm-master kernel: [2009644.546935] [<ffffffffa0096831>] ? make_request+0x111/0x9c8 [raid1] Feb 12 07:03:27 vm-master kernel: [2009644.546941] [<ffffffffa0102c76>] ? clone_bio+0x43/0xcb [dm_mod] Feb 12 07:03:27 vm-master kernel: [2009644.546946] [<ffffffffa010386f>] ? __split_and_process_bio+0x4f4/0x506 [dm_mod] Feb 12 07:03:27 vm-master kernel: [2009644.546952] [<ffffffffa00edd93>] ? md_make_request+0xce/0x1c3 [md_mod] Feb 12 07:03:27 vm-master kernel: [2009644.546956] [<ffffffff81188919>] ? generic_make_request+0x270/0x2ea Feb 12 07:03:27 vm-master kernel: [2009644.546960] [<ffffffff81188a66>] ? submit_bio+0xd3/0xf1 Feb 12 07:03:27 vm-master kernel: [2009644.546964] [<ffffffff81119181>] ? __bio_add_page.part.12+0x135/0x1ed Feb 12 07:03:27 vm-master kernel: [2009644.546967] [<ffffffff8111bbc1>] ? dio_bio_submit+0x6c/0x8a Feb 12 07:03:27 vm-master kernel: [2009644.546970] [<ffffffff8111becd>] ? dio_send_cur_page+0x6e/0x91 Feb 12 07:03:27 vm-master kernel: [2009644.546972] [<ffffffff8111bfa3>] ? submit_page_section+0xb3/0x11a Feb 12 07:03:27 vm-master kernel: [2009644.546975] [<ffffffff8111c7d8>] ? __blockdev_direct_IO+0x68a/0x995 Feb 12 07:03:27 vm-master kernel: [2009644.546978] [<ffffffff8111a836>] ? blkdev_direct_IO+0x4e/0x53 Feb 12 07:03:27 vm-master kernel: [2009644.546981] [<ffffffff8111ab5b>] ? blkdev_get_block+0x5b/0x5b Feb 12 07:03:27 vm-master kernel: [2009644.546985] [<ffffffff810b0fdb>] ? generic_file_direct_write+0xdc/0x146 Feb 12 07:03:27 vm-master kernel: [2009644.546987] [<ffffffff810b11d9>] ? __generic_file_aio_write+0x194/0x278 Feb 12 07:03:27 vm-master kernel: [2009644.546992] [<ffffffff810131f1>] ? paravirt_read_tsc+0x5/0x8 Feb 12 07:03:27 vm-master kernel: [2009644.546995] [<ffffffff810f4501>] ? rw_copy_check_uvector+0x48/0xf8 Feb 12 07:03:27 vm-master kernel: [2009644.546998] [<ffffffff8111ac12>] ? bd_may_claim+0x2e/0x2e Feb 12 07:03:27 vm-master kernel: [2009644.547000] [<ffffffff8111ac31>] ? blkdev_aio_write+0x1f/0x61 Feb 12 07:03:27 vm-master kernel: [2009644.547003] [<ffffffff8111ac12>] ? bd_may_claim+0x2e/0x2e Feb 12 07:03:27 vm-master kernel: [2009644.547005] [<ffffffff811249e8>] ? aio_rw_vect_retry+0x70/0x18e Feb 12 07:03:27 vm-master kernel: [2009644.547008] [<ffffffff81124978>] ? lookup_ioctx+0x53/0x53 Feb 12 07:03:27 vm-master kernel: [2009644.547010] [<ffffffff811253fe>] ? aio_run_iocb+0x70/0x11b Feb 12 07:03:27 vm-master kernel: [2009644.547013] [<ffffffff8112645b>] ? do_io_submit+0x442/0x4d7 Feb 12 07:03:27 vm-master kernel: [2009644.547017] [<ffffffff81332e12>] ? system_call_fastpath+0x16/0x1b Feb 12 07:03:27 vm-master kernel: [2009644.547265] INFO: task md1_resync:15384 blocked for more than 120 seconds. Feb 12 07:03:27 vm-master kernel: [2009644.547319] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Feb 12 07:03:27 vm-master kernel: [2009644.547411] md1_resync D ffff88067fc12f40 0 15384 2 0x00000000 Feb 12 07:03:27 vm-master kernel: [2009644.547414] ffff880537dc0ee0 0000000000000046 0000000000000000 ffffffff8160d020 Feb 12 07:03:27 vm-master kernel: [2009644.547418] 0000000000012f40 ffff8801212affd8 ffff8801212affd8 ffff880537dc0ee0 Feb 12 07:03:27 vm-master kernel: [2009644.547421] 0000000000011210 ffffffff81070175 0000000000000046 ffff8806633225c0 Feb 12 07:03:27 vm-master kernel: [2009644.547424] Call Trace: Feb 12 07:03:27 vm-master kernel: [2009644.547429] [<ffffffff81070175>] ? arch_local_irq_save+0x11/0x17 Feb 12 07:03:27 vm-master kernel: [2009644.547433] [<ffffffffa0093917>] ? raise_barrier+0x11a/0x14d [raid1] Feb 12 07:03:27 vm-master kernel: [2009644.547436] [<ffffffff8103e3ca>] ? try_to_wake_up+0x192/0x192 Feb 12 07:03:27 vm-master kernel: [2009644.547440] [<ffffffffa0095fc7>] ? sync_request+0x192/0x70b [raid1] Feb 12 07:03:27 vm-master kernel: [2009644.547446] [<ffffffffa00f147c>] ? md_do_sync+0x760/0xb64 [md_mod] Feb 12 07:03:27 vm-master kernel: [2009644.547450] [<ffffffff8103493d>] ? set_task_rq+0x23/0x35 Feb 12 07:03:27 vm-master kernel: [2009644.547454] [<ffffffff8105ec6b>] ? add_wait_queue+0x3c/0x3c Feb 12 07:03:27 vm-master kernel: [2009644.547459] [<ffffffffa00ee28a>] ? md_thread+0x101/0x11f [md_mod] Feb 12 07:03:27 vm-master kernel: [2009644.547464] [<ffffffffa00ee189>] ? md_rdev_init+0xea/0xea [md_mod] Feb 12 07:03:27 vm-master kernel: [2009644.547467] [<ffffffff8105e625>] ? kthread+0x76/0x7e Feb 12 07:03:27 vm-master kernel: [2009644.547470] [<ffffffff81334f74>] ? kernel_thread_helper+0x4/0x10 Feb 12 07:03:27 vm-master kernel: [2009644.547473] [<ffffffff8105e5af>] ? kthread_worker_fn+0x139/0x139 Feb 12 07:03:27 vm-master kernel: [2009644.547475] [<ffffffff81334f70>] ? gs_change+0x13/0x13 Also, at 06:50 one of the virtual machines did start to generate heavy I/O load (backup). So this probably together with the resync run set the scene for this to happen. Afterwards all DRBD devices were "blocking", e.g. all virtual machines running on these devices were unresponsive and had to be shutdown "cold". I also tried to disconnect & reconnect to the peer the DRBD devices, but to no avail. The server had to be restarted. Are there any known problems with DRBD on top of a MDADM raid? I noticed that unlike for regular access from userspace, DRBD access to the RAID seems to not throttle the verify/rebuild run down to idle speed (1000K/s) but rather keep going at nearly full speed ?! Thanks, Andreas