[DRBD-user] Kernel hung on DRBD / MD RAID

Tue Feb 14 15:33:13 CET 2012

May be a error in drdb you have many sync in drdb :) As I read see that you
make this configuration.

http://www.drbd.org/users-guide-8.3/s-nested-lvm.html

Did you make any configuration changes? or kernel changes?...

the fail comes from kvm did you make any upgrade recently??

On Mon, Feb 13, 2012 at 11:13 PM, Andreas Bauer <ab at voltage.de> wrote:

> > Can you describe to us the configuration.. were is the Software RAID1?
> if you
> > can send the mount points.. What filesystem?
>
> The RAID:
>
> root at vm-master:~# cat /proc/mdstat
> Personalities : [raid1]
> md0 : active raid1 sda1[0] sdb1[1]
>      976759672 blocks super 1.2 [2/2] [UU]
>
> md1 : active raid1 sdc1[0] sdd1[12]
>      976759672 blocks super 1.2 [2/2] [UU]
>
> The LVM:
>
> root at vm-master:~# pvs
>  PV         VG   Fmt  Attr PSize   PFree
>  /dev/md0   vg   lvm2 a--  931,51g 385,30g
>  /dev/md1   vg   lvm2 a--  931,51g 450,25g
>
>
> On top of LVM sits the DRBD, and on top of DRBD is KVM (no filesystem):
>
>    <disk type='file' device='disk'>
>      <source file='/dev/drbd6'/>
>      <driver name="qemu" type="raw" io="native" cache="none" />
>      <target dev='vda' bus='virtio' />
>    </disk>
>
> DRBD:
>
> root at vm-master:~# cat /proc/drbd
> version: 8.3.11 (api:88/proto:86-96)
> srcversion: 21CA73FE6D7D9C67B0C6AB2
>
>  1: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate A r-----
>    ns:521268 nr:347036 dw:348112 dr:1007984 al:3 bm:49 lo:0 pe:0 ua:0 ap:0
> ep:1 wo:f oos:0
>  2: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate A r-----
>    ns:53248 nr:472 dw:472 dr:54480 al:0 bm:8 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f
> oos:0
>  3: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate A r-----
>    ns:1933528 nr:0 dw:1413336 dr:744511 al:132 bm:73 lo:0 pe:0 ua:0 ap:0
> ep:1 wo:f oos:0
>  4: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate A r-----
>    ns:0 nr:0 dw:0 dr:1104 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
>  5: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate A r-----
>    ns:0 nr:0 dw:0 dr:1104 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
>  6: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate A r-----
>    ns:520192 nr:19824040 dw:19824040 dr:520192 al:0 bm:82 lo:0 pe:0 ua:0
> ap:0 ep:1 wo:f oos:0
>  7: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate A r-----
>    ns:520192 nr:10933108 dw:10933108 dr:520192 al:0 bm:76 lo:0 pe:0 ua:0
> ap:0 ep:1 wo:f oos:0
>  8: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate A r-----
>    ns:282624 nr:1518364 dw:1518364 dr:282624 al:0 bm:36 lo:0 pe:0 ua:0
> ap:0 ep:1 wo:f oos:0
>  9: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate A r-----
>    ns:0 nr:2262112 dw:2262112 dr:0 al:0 bm:82 lo:0 pe:0 ua:0 ap:0 ep:1
> wo:f oos:0
>
> 20: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate A r-----
>    ns:0 nr:89151580 dw:243215196 dr:1987664 al:692 bm:14963 lo:0 pe:0 ua:0
> ap:0 ep:1 wo:f oos:0
> 21: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate A r-----
>    ns:999612 nr:88 dw:479508 dr:2799008 al:330 bm:78 lo:0 pe:0 ua:0 ap:0
> ep:1 wo:f oos:0
> 22: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate A r-----
>    ns:520192 nr:4488702 dw:4488702 dr:520192 al:0 bm:101 lo:0 pe:0 ua:0
> ap:0 ep:1 wo:f oos:0
> 23: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate A r-----
>    ns:24576 nr:123638 dw:123638 dr:24576 al:0 bm:4 lo:0 pe:0 ua:0 ap:0
> ep:1 wo:f oos:0
>
>
> Small error below, it is of course DRBD 8.3.11 (not 8.4.11).
>
> Again, the error occured when a RAID verify and an early morning backup in
> one of the virtual machines occured at the same time. As I understand, the
> error itself just says that a task was blocked for more than 2 minutes. The
> strange thing is that this situation did not recover by itself, but
> completely blocked *ALL* virtual machines.
>
> That means, shortly afterwards all virtual machines were completely
> unresponsive (network, console, ...), and DRBD was unresponsive as well.
> For example "fdisk /dev/drbd20" would block.
>
> On shutdown there supposedly was an error that the drbd module could not
> be unloaded (I was only remote so cannot tell for sure).
>
> BTW:
>
> root at vm-master:/sys/class/block# for i in * ;
> > do
> >    echo -n $i scheduler: ; cat $i/queue/scheduler
> > done
> dm-0 scheduler:none
> dm-1 scheduler:none
> dm-11 scheduler:none
> dm-14 scheduler:none
> dm-15 scheduler:none
> dm-16 scheduler:none
> dm-17 scheduler:none
> dm-18 scheduler:none
> dm-19 scheduler:none
> dm-2 scheduler:none
> dm-20 scheduler:none
> dm-21 scheduler:none
> dm-22 scheduler:none
> dm-23 scheduler:none
> dm-24 scheduler:none
> dm-25 scheduler:none
> dm-26 scheduler:none
> dm-27 scheduler:none
> dm-28 scheduler:none
> dm-29 scheduler:none
> dm-3 scheduler:none
> dm-30 scheduler:none
> dm-31 scheduler:none
> dm-4 scheduler:none
> dm-5 scheduler:none
> dm-6 scheduler:none
> dm-7 scheduler:none
> dm-8 scheduler:none
> dm-9 scheduler:none
> drbd1 scheduler:none
> drbd2 scheduler:none
> drbd20 scheduler:none
> drbd21 scheduler:none
> drbd22 scheduler:none
> drbd23 scheduler:none
> drbd3 scheduler:none
> drbd4 scheduler:none
> drbd5 scheduler:none
> drbd6 scheduler:none
> drbd7 scheduler:none
> drbd8 scheduler:none
> drbd9 scheduler:none
> loop0 scheduler:none
> loop1 scheduler:none
> loop2 scheduler:none
> loop3 scheduler:none
> loop4 scheduler:none
> loop5 scheduler:none
> loop6 scheduler:none
> loop7 scheduler:none
> md0 scheduler:none
> md1 scheduler:none
> sda scheduler:noop [deadline] cfq
> sdb scheduler:noop [deadline] cfq
> sdc scheduler:noop [deadline] cfq
> sdd scheduler:noop [deadline] cfq
> sr0 scheduler:noop deadline [cfq]
>
> Any ideas what happened here and how to avoid that in the future?
>
> Thanks!
>
> > Kernel 3.1.0 / DRBD 8.4.11
> >
> > KVM virtual machines, running on top of
> > DRBD (protocol A, sndbuf-size 10M, data-integrity-alg md5, external meta
> data,
> > meta data on different physical disk), running on top of
> > LVM2, running on top of
> > Software RAID 1
> >
> > This night there was a MDADM verify run which was still ongoing this
> morning,
> > when this happened:
> >
> > Feb 12 07:03:27 vm-master kernel: [2009644.546758] INFO: task kvm:21152
> blocked
> > for more than 120 seconds.
> > Feb 12 07:03:27 vm-master kernel: [2009644.546813] "echo 0 >
> > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > Feb 12 07:03:27 vm-master kernel: [2009644.546896] kvm             D
> > ffff88067fc92f40     0 21152      1 0x00000000
> > Feb 12 07:03:27 vm-master kernel: [2009644.546901]  ffff8806633a4fa0
> > 0000000000000082 ffff880600000000 ffff8806668ec0c0
> > Feb 12 07:03:27 vm-master kernel: [2009644.546904]  0000000000012f40
> > ffff88017ae01fd8 ffff88017ae01fd8 ffff8806633a4fa0
> > Feb 12 07:03:27 vm-master kernel: [2009644.546908]  ffff88017ae01fd8
> > 0000000181070175 0000000000000046 ffff8806633225c0
> > Feb 12 07:03:27 vm-master kernel: [2009644.546911] Call Trace:
> > Feb 12 07:03:27 vm-master kernel: [2009644.546925]  [<ffffffffa00939d1>]
> ?
> > wait_barrier+0x87/0xc0 [raid1]
> > Feb 12 07:03:27 vm-master kernel: [2009644.546931]  [<ffffffff8103e3ca>]
> ?
> > try_to_wake_up+0x192/0x192
> > Feb 12 07:03:27 vm-master kernel: [2009644.546935]  [<ffffffffa0096831>]
> ?
> > make_request+0x111/0x9c8 [raid1]
> > Feb 12 07:03:27 vm-master kernel: [2009644.546941]  [<ffffffffa0102c76>]
> ?
> > clone_bio+0x43/0xcb [dm_mod]
> > Feb 12 07:03:27 vm-master kernel: [2009644.546946]  [<ffffffffa010386f>]
> ?
> > __split_and_process_bio+0x4f4/0x506 [dm_mod]
> > Feb 12 07:03:27 vm-master kernel: [2009644.546952]  [<ffffffffa00edd93>]
> ?
> > md_make_request+0xce/0x1c3 [md_mod]
> > Feb 12 07:03:27 vm-master kernel: [2009644.546956]  [<ffffffff81188919>]
> ?
> > generic_make_request+0x270/0x2ea
> > Feb 12 07:03:27 vm-master kernel: [2009644.546960]  [<ffffffff81188a66>]
> ?
> > submit_bio+0xd3/0xf1
> > Feb 12 07:03:27 vm-master kernel: [2009644.546964]  [<ffffffff81119181>]
> ?
> > __bio_add_page.part.12+0x135/0x1ed
> > Feb 12 07:03:27 vm-master kernel: [2009644.546967]  [<ffffffff8111bbc1>]
> ?
> > dio_bio_submit+0x6c/0x8a
> > Feb 12 07:03:27 vm-master kernel: [2009644.546970]  [<ffffffff8111becd>]
> ?
> > dio_send_cur_page+0x6e/0x91
> > Feb 12 07:03:27 vm-master kernel: [2009644.546972]  [<ffffffff8111bfa3>]
> ?
> > submit_page_section+0xb3/0x11a
> > Feb 12 07:03:27 vm-master kernel: [2009644.546975]  [<ffffffff8111c7d8>]
> ?
> > __blockdev_direct_IO+0x68a/0x995
> > Feb 12 07:03:27 vm-master kernel: [2009644.546978]  [<ffffffff8111a836>]
> ?
> > blkdev_direct_IO+0x4e/0x53
> > Feb 12 07:03:27 vm-master kernel: [2009644.546981]  [<ffffffff8111ab5b>]
> ?
> > blkdev_get_block+0x5b/0x5b
> > Feb 12 07:03:27 vm-master kernel: [2009644.546985]  [<ffffffff810b0fdb>]
> ?
> > generic_file_direct_write+0xdc/0x146
> > Feb 12 07:03:27 vm-master kernel: [2009644.546987]  [<ffffffff810b11d9>]
> ?
> > __generic_file_aio_write+0x194/0x278
> > Feb 12 07:03:27 vm-master kernel: [2009644.546992]  [<ffffffff810131f1>]
> ?
> > paravirt_read_tsc+0x5/0x8
> > Feb 12 07:03:27 vm-master kernel: [2009644.546995]  [<ffffffff810f4501>]
> ?
> > rw_copy_check_uvector+0x48/0xf8
> > Feb 12 07:03:27 vm-master kernel: [2009644.546998]  [<ffffffff8111ac12>]
> ?
> > bd_may_claim+0x2e/0x2e
> > Feb 12 07:03:27 vm-master kernel: [2009644.547000]  [<ffffffff8111ac31>]
> ?
> > blkdev_aio_write+0x1f/0x61
> > Feb 12 07:03:27 vm-master kernel: [2009644.547003]  [<ffffffff8111ac12>]
> ?
> > bd_may_claim+0x2e/0x2e
> > Feb 12 07:03:27 vm-master kernel: [2009644.547005]  [<ffffffff811249e8>]
> ?
> > aio_rw_vect_retry+0x70/0x18e
> > Feb 12 07:03:27 vm-master kernel: [2009644.547008]  [<ffffffff81124978>]
> ?
> > lookup_ioctx+0x53/0x53
> > Feb 12 07:03:27 vm-master kernel: [2009644.547010]  [<ffffffff811253fe>]
> ?
> > aio_run_iocb+0x70/0x11b
> > Feb 12 07:03:27 vm-master kernel: [2009644.547013]  [<ffffffff8112645b>]
> ?
> > do_io_submit+0x442/0x4d7
> > Feb 12 07:03:27 vm-master kernel: [2009644.547017]  [<ffffffff81332e12>]
> ?
> > system_call_fastpath+0x16/0x1b
> >
> > Feb 12 07:03:27 vm-master kernel: [2009644.547265] INFO: task
> md1_resync:15384
> > blocked for more than 120 seconds.
> > Feb 12 07:03:27 vm-master kernel: [2009644.547319] "echo 0 >
> > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > Feb 12 07:03:27 vm-master kernel: [2009644.547411] md1_resync      D
> > ffff88067fc12f40     0 15384      2 0x00000000
> > Feb 12 07:03:27 vm-master kernel: [2009644.547414]  ffff880537dc0ee0
> > 0000000000000046 0000000000000000 ffffffff8160d020
> > Feb 12 07:03:27 vm-master kernel: [2009644.547418]  0000000000012f40
> > ffff8801212affd8 ffff8801212affd8 ffff880537dc0ee0
> > Feb 12 07:03:27 vm-master kernel: [2009644.547421]  0000000000011210
> > ffffffff81070175 0000000000000046 ffff8806633225c0
> > Feb 12 07:03:27 vm-master kernel: [2009644.547424] Call Trace:
> > Feb 12 07:03:27 vm-master kernel: [2009644.547429]  [<ffffffff81070175>]
> ?
> > arch_local_irq_save+0x11/0x17
> > Feb 12 07:03:27 vm-master kernel: [2009644.547433]  [<ffffffffa0093917>]
> ?
> > raise_barrier+0x11a/0x14d [raid1]
> > Feb 12 07:03:27 vm-master kernel: [2009644.547436]  [<ffffffff8103e3ca>]
> ?
> > try_to_wake_up+0x192/0x192
> > Feb 12 07:03:27 vm-master kernel: [2009644.547440]  [<ffffffffa0095fc7>]
> ?
> > sync_request+0x192/0x70b [raid1]
> > Feb 12 07:03:27 vm-master kernel: [2009644.547446]  [<ffffffffa00f147c>]
> ?
> > md_do_sync+0x760/0xb64 [md_mod]
> > Feb 12 07:03:27 vm-master kernel: [2009644.547450]  [<ffffffff8103493d>]
> ?
> > set_task_rq+0x23/0x35
> > Feb 12 07:03:27 vm-master kernel: [2009644.547454]  [<ffffffff8105ec6b>]
> ?
> > add_wait_queue+0x3c/0x3c
> > Feb 12 07:03:27 vm-master kernel: [2009644.547459]  [<ffffffffa00ee28a>]
> ?
> > md_thread+0x101/0x11f [md_mod]
> > Feb 12 07:03:27 vm-master kernel: [2009644.547464]  [<ffffffffa00ee189>]
> ?
> > md_rdev_init+0xea/0xea [md_mod]
> > Feb 12 07:03:27 vm-master kernel: [2009644.547467]  [<ffffffff8105e625>]
> ?
> > kthread+0x76/0x7e
> > Feb 12 07:03:27 vm-master kernel: [2009644.547470]  [<ffffffff81334f74>]
> ?
> > kernel_thread_helper+0x4/0x10
> > Feb 12 07:03:27 vm-master kernel: [2009644.547473]  [<ffffffff8105e5af>]
> ?
> > kthread_worker_fn+0x139/0x139
> > Feb 12 07:03:27 vm-master kernel: [2009644.547475]  [<ffffffff81334f70>]
> ?
> > gs_change+0x13/0x13
> >
> > Also, at 06:50 one of the virtual machines did start to generate heavy
> I/O load
> > (backup). So this probably together with the resync run set the scene
> for this
> > to happen.
> >
> > Afterwards all DRBD devices were "blocking", e.g. all virtual machines
> running
> > on these devices were unresponsive and had to be shutdown "cold".
> >
> > I also tried to disconnect & reconnect to the peer the DRBD devices, but
> to no
> > avail. The server had to be restarted.
> >
> > Are there any known problems with DRBD on top of a MDADM raid? I noticed
> that
> > unlike for regular access from userspace, DRBD access to the RAID seems
> to not
> > throttle the verify/rebuild run down to idle speed (1000K/s) but rather
> keep
> > going at nearly full speed ?!
> >
> > Thanks, Andreas
> > _______________________________________________
> > drbd-user mailing list
> > drbd-user at lists.linbit.com
> > http://lists.linbit.com/mailman/listinfo/drbd-user
> >
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20120214/d04396ce/attachment.htm>