[DRBD-user] Kernel hung on DRBD / MD RAID

Andreas Bauer ab at voltage.de
Sun Feb 12 09:54:56 CET 2012

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hello,

this morning I had a nasty situation on a Primary/Secondary DRBD cluster. First the setup:

Kernel 3.1.0 / DRBD 8.4.11

KVM virtual machines, running on top of
DRBD (protocol A, sndbuf-size 10M, data-integrity-alg md5, external meta data, meta data on different physical disk), running on top of
LVM2, running on top of
Software RAID 1

This night there was a MDADM verify run which was still ongoing this morning, when this happened:

Feb 12 07:03:27 vm-master kernel: [2009644.546758] INFO: task kvm:21152 blocked for more than 120 seconds.
Feb 12 07:03:27 vm-master kernel: [2009644.546813] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 12 07:03:27 vm-master kernel: [2009644.546896] kvm             D ffff88067fc92f40     0 21152      1 0x00000000
Feb 12 07:03:27 vm-master kernel: [2009644.546901]  ffff8806633a4fa0 0000000000000082 ffff880600000000 ffff8806668ec0c0
Feb 12 07:03:27 vm-master kernel: [2009644.546904]  0000000000012f40 ffff88017ae01fd8 ffff88017ae01fd8 ffff8806633a4fa0
Feb 12 07:03:27 vm-master kernel: [2009644.546908]  ffff88017ae01fd8 0000000181070175 0000000000000046 ffff8806633225c0
Feb 12 07:03:27 vm-master kernel: [2009644.546911] Call Trace:
Feb 12 07:03:27 vm-master kernel: [2009644.546925]  [<ffffffffa00939d1>] ? wait_barrier+0x87/0xc0 [raid1]
Feb 12 07:03:27 vm-master kernel: [2009644.546931]  [<ffffffff8103e3ca>] ? try_to_wake_up+0x192/0x192
Feb 12 07:03:27 vm-master kernel: [2009644.546935]  [<ffffffffa0096831>] ? make_request+0x111/0x9c8 [raid1]
Feb 12 07:03:27 vm-master kernel: [2009644.546941]  [<ffffffffa0102c76>] ? clone_bio+0x43/0xcb [dm_mod]
Feb 12 07:03:27 vm-master kernel: [2009644.546946]  [<ffffffffa010386f>] ? __split_and_process_bio+0x4f4/0x506 [dm_mod]
Feb 12 07:03:27 vm-master kernel: [2009644.546952]  [<ffffffffa00edd93>] ? md_make_request+0xce/0x1c3 [md_mod]
Feb 12 07:03:27 vm-master kernel: [2009644.546956]  [<ffffffff81188919>] ? generic_make_request+0x270/0x2ea
Feb 12 07:03:27 vm-master kernel: [2009644.546960]  [<ffffffff81188a66>] ? submit_bio+0xd3/0xf1
Feb 12 07:03:27 vm-master kernel: [2009644.546964]  [<ffffffff81119181>] ? __bio_add_page.part.12+0x135/0x1ed
Feb 12 07:03:27 vm-master kernel: [2009644.546967]  [<ffffffff8111bbc1>] ? dio_bio_submit+0x6c/0x8a
Feb 12 07:03:27 vm-master kernel: [2009644.546970]  [<ffffffff8111becd>] ? dio_send_cur_page+0x6e/0x91
Feb 12 07:03:27 vm-master kernel: [2009644.546972]  [<ffffffff8111bfa3>] ? submit_page_section+0xb3/0x11a
Feb 12 07:03:27 vm-master kernel: [2009644.546975]  [<ffffffff8111c7d8>] ? __blockdev_direct_IO+0x68a/0x995
Feb 12 07:03:27 vm-master kernel: [2009644.546978]  [<ffffffff8111a836>] ? blkdev_direct_IO+0x4e/0x53
Feb 12 07:03:27 vm-master kernel: [2009644.546981]  [<ffffffff8111ab5b>] ? blkdev_get_block+0x5b/0x5b
Feb 12 07:03:27 vm-master kernel: [2009644.546985]  [<ffffffff810b0fdb>] ? generic_file_direct_write+0xdc/0x146
Feb 12 07:03:27 vm-master kernel: [2009644.546987]  [<ffffffff810b11d9>] ? __generic_file_aio_write+0x194/0x278
Feb 12 07:03:27 vm-master kernel: [2009644.546992]  [<ffffffff810131f1>] ? paravirt_read_tsc+0x5/0x8
Feb 12 07:03:27 vm-master kernel: [2009644.546995]  [<ffffffff810f4501>] ? rw_copy_check_uvector+0x48/0xf8
Feb 12 07:03:27 vm-master kernel: [2009644.546998]  [<ffffffff8111ac12>] ? bd_may_claim+0x2e/0x2e
Feb 12 07:03:27 vm-master kernel: [2009644.547000]  [<ffffffff8111ac31>] ? blkdev_aio_write+0x1f/0x61
Feb 12 07:03:27 vm-master kernel: [2009644.547003]  [<ffffffff8111ac12>] ? bd_may_claim+0x2e/0x2e
Feb 12 07:03:27 vm-master kernel: [2009644.547005]  [<ffffffff811249e8>] ? aio_rw_vect_retry+0x70/0x18e
Feb 12 07:03:27 vm-master kernel: [2009644.547008]  [<ffffffff81124978>] ? lookup_ioctx+0x53/0x53
Feb 12 07:03:27 vm-master kernel: [2009644.547010]  [<ffffffff811253fe>] ? aio_run_iocb+0x70/0x11b
Feb 12 07:03:27 vm-master kernel: [2009644.547013]  [<ffffffff8112645b>] ? do_io_submit+0x442/0x4d7
Feb 12 07:03:27 vm-master kernel: [2009644.547017]  [<ffffffff81332e12>] ? system_call_fastpath+0x16/0x1b

Feb 12 07:03:27 vm-master kernel: [2009644.547265] INFO: task md1_resync:15384 blocked for more than 120 seconds.
Feb 12 07:03:27 vm-master kernel: [2009644.547319] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 12 07:03:27 vm-master kernel: [2009644.547411] md1_resync      D ffff88067fc12f40     0 15384      2 0x00000000
Feb 12 07:03:27 vm-master kernel: [2009644.547414]  ffff880537dc0ee0 0000000000000046 0000000000000000 ffffffff8160d020
Feb 12 07:03:27 vm-master kernel: [2009644.547418]  0000000000012f40 ffff8801212affd8 ffff8801212affd8 ffff880537dc0ee0
Feb 12 07:03:27 vm-master kernel: [2009644.547421]  0000000000011210 ffffffff81070175 0000000000000046 ffff8806633225c0
Feb 12 07:03:27 vm-master kernel: [2009644.547424] Call Trace:
Feb 12 07:03:27 vm-master kernel: [2009644.547429]  [<ffffffff81070175>] ? arch_local_irq_save+0x11/0x17
Feb 12 07:03:27 vm-master kernel: [2009644.547433]  [<ffffffffa0093917>] ? raise_barrier+0x11a/0x14d [raid1]
Feb 12 07:03:27 vm-master kernel: [2009644.547436]  [<ffffffff8103e3ca>] ? try_to_wake_up+0x192/0x192 
Feb 12 07:03:27 vm-master kernel: [2009644.547440]  [<ffffffffa0095fc7>] ? sync_request+0x192/0x70b [raid1]
Feb 12 07:03:27 vm-master kernel: [2009644.547446]  [<ffffffffa00f147c>] ? md_do_sync+0x760/0xb64 [md_mod] 
Feb 12 07:03:27 vm-master kernel: [2009644.547450]  [<ffffffff8103493d>] ? set_task_rq+0x23/0x35
Feb 12 07:03:27 vm-master kernel: [2009644.547454]  [<ffffffff8105ec6b>] ? add_wait_queue+0x3c/0x3c  
Feb 12 07:03:27 vm-master kernel: [2009644.547459]  [<ffffffffa00ee28a>] ? md_thread+0x101/0x11f [md_mod]  
Feb 12 07:03:27 vm-master kernel: [2009644.547464]  [<ffffffffa00ee189>] ? md_rdev_init+0xea/0xea [md_mod]
Feb 12 07:03:27 vm-master kernel: [2009644.547467]  [<ffffffff8105e625>] ? kthread+0x76/0x7e
Feb 12 07:03:27 vm-master kernel: [2009644.547470]  [<ffffffff81334f74>] ? kernel_thread_helper+0x4/0x10
Feb 12 07:03:27 vm-master kernel: [2009644.547473]  [<ffffffff8105e5af>] ? kthread_worker_fn+0x139/0x139  
Feb 12 07:03:27 vm-master kernel: [2009644.547475]  [<ffffffff81334f70>] ? gs_change+0x13/0x13

Also, at 06:50 one of the virtual machines did start to generate heavy I/O load (backup). So this probably together with the resync run set the scene for this to happen.

Afterwards all DRBD devices were "blocking", e.g. all virtual machines running on these devices were unresponsive and had to be shutdown "cold".

I also tried to disconnect & reconnect to the peer the DRBD devices, but to no avail. The server had to be restarted.

Are there any known problems with DRBD on top of a MDADM raid? I noticed that unlike for regular access from userspace, DRBD access to the RAID seems to not throttle the verify/rebuild run down to idle speed (1000K/s) but rather keep going at nearly full speed ?!

Thanks, Andreas



More information about the drbd-user mailing list