Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
May be a error in drdb you have many sync in drdb :) As I read see that you make this configuration. http://www.drbd.org/users-guide-8.3/s-nested-lvm.html Did you make any configuration changes? or kernel changes?... the fail comes from kvm did you make any upgrade recently?? On Mon, Feb 13, 2012 at 11:13 PM, Andreas Bauer <ab at voltage.de> wrote: > > Can you describe to us the configuration.. were is the Software RAID1? > if you > > can send the mount points.. What filesystem? > > The RAID: > > root at vm-master:~# cat /proc/mdstat > Personalities : [raid1] > md0 : active raid1 sda1[0] sdb1[1] > 976759672 blocks super 1.2 [2/2] [UU] > > md1 : active raid1 sdc1[0] sdd1[12] > 976759672 blocks super 1.2 [2/2] [UU] > > The LVM: > > root at vm-master:~# pvs > PV VG Fmt Attr PSize PFree > /dev/md0 vg lvm2 a-- 931,51g 385,30g > /dev/md1 vg lvm2 a-- 931,51g 450,25g > > > On top of LVM sits the DRBD, and on top of DRBD is KVM (no filesystem): > > <disk type='file' device='disk'> > <source file='/dev/drbd6'/> > <driver name="qemu" type="raw" io="native" cache="none" /> > <target dev='vda' bus='virtio' /> > </disk> > > DRBD: > > root at vm-master:~# cat /proc/drbd > version: 8.3.11 (api:88/proto:86-96) > srcversion: 21CA73FE6D7D9C67B0C6AB2 > > 1: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate A r----- > ns:521268 nr:347036 dw:348112 dr:1007984 al:3 bm:49 lo:0 pe:0 ua:0 ap:0 > ep:1 wo:f oos:0 > 2: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate A r----- > ns:53248 nr:472 dw:472 dr:54480 al:0 bm:8 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f > oos:0 > 3: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate A r----- > ns:1933528 nr:0 dw:1413336 dr:744511 al:132 bm:73 lo:0 pe:0 ua:0 ap:0 > ep:1 wo:f oos:0 > 4: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate A r----- > ns:0 nr:0 dw:0 dr:1104 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0 > 5: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate A r----- > ns:0 nr:0 dw:0 dr:1104 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0 > 6: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate A r----- > ns:520192 nr:19824040 dw:19824040 dr:520192 al:0 bm:82 lo:0 pe:0 ua:0 > ap:0 ep:1 wo:f oos:0 > 7: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate A r----- > ns:520192 nr:10933108 dw:10933108 dr:520192 al:0 bm:76 lo:0 pe:0 ua:0 > ap:0 ep:1 wo:f oos:0 > 8: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate A r----- > ns:282624 nr:1518364 dw:1518364 dr:282624 al:0 bm:36 lo:0 pe:0 ua:0 > ap:0 ep:1 wo:f oos:0 > 9: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate A r----- > ns:0 nr:2262112 dw:2262112 dr:0 al:0 bm:82 lo:0 pe:0 ua:0 ap:0 ep:1 > wo:f oos:0 > > 20: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate A r----- > ns:0 nr:89151580 dw:243215196 dr:1987664 al:692 bm:14963 lo:0 pe:0 ua:0 > ap:0 ep:1 wo:f oos:0 > 21: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate A r----- > ns:999612 nr:88 dw:479508 dr:2799008 al:330 bm:78 lo:0 pe:0 ua:0 ap:0 > ep:1 wo:f oos:0 > 22: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate A r----- > ns:520192 nr:4488702 dw:4488702 dr:520192 al:0 bm:101 lo:0 pe:0 ua:0 > ap:0 ep:1 wo:f oos:0 > 23: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate A r----- > ns:24576 nr:123638 dw:123638 dr:24576 al:0 bm:4 lo:0 pe:0 ua:0 ap:0 > ep:1 wo:f oos:0 > > > Small error below, it is of course DRBD 8.3.11 (not 8.4.11). > > Again, the error occured when a RAID verify and an early morning backup in > one of the virtual machines occured at the same time. As I understand, the > error itself just says that a task was blocked for more than 2 minutes. The > strange thing is that this situation did not recover by itself, but > completely blocked *ALL* virtual machines. > > That means, shortly afterwards all virtual machines were completely > unresponsive (network, console, ...), and DRBD was unresponsive as well. > For example "fdisk /dev/drbd20" would block. > > On shutdown there supposedly was an error that the drbd module could not > be unloaded (I was only remote so cannot tell for sure). > > BTW: > > root at vm-master:/sys/class/block# for i in * ; > > do > > echo -n $i scheduler: ; cat $i/queue/scheduler > > done > dm-0 scheduler:none > dm-1 scheduler:none > dm-11 scheduler:none > dm-14 scheduler:none > dm-15 scheduler:none > dm-16 scheduler:none > dm-17 scheduler:none > dm-18 scheduler:none > dm-19 scheduler:none > dm-2 scheduler:none > dm-20 scheduler:none > dm-21 scheduler:none > dm-22 scheduler:none > dm-23 scheduler:none > dm-24 scheduler:none > dm-25 scheduler:none > dm-26 scheduler:none > dm-27 scheduler:none > dm-28 scheduler:none > dm-29 scheduler:none > dm-3 scheduler:none > dm-30 scheduler:none > dm-31 scheduler:none > dm-4 scheduler:none > dm-5 scheduler:none > dm-6 scheduler:none > dm-7 scheduler:none > dm-8 scheduler:none > dm-9 scheduler:none > drbd1 scheduler:none > drbd2 scheduler:none > drbd20 scheduler:none > drbd21 scheduler:none > drbd22 scheduler:none > drbd23 scheduler:none > drbd3 scheduler:none > drbd4 scheduler:none > drbd5 scheduler:none > drbd6 scheduler:none > drbd7 scheduler:none > drbd8 scheduler:none > drbd9 scheduler:none > loop0 scheduler:none > loop1 scheduler:none > loop2 scheduler:none > loop3 scheduler:none > loop4 scheduler:none > loop5 scheduler:none > loop6 scheduler:none > loop7 scheduler:none > md0 scheduler:none > md1 scheduler:none > sda scheduler:noop [deadline] cfq > sdb scheduler:noop [deadline] cfq > sdc scheduler:noop [deadline] cfq > sdd scheduler:noop [deadline] cfq > sr0 scheduler:noop deadline [cfq] > > Any ideas what happened here and how to avoid that in the future? > > Thanks! > > > Kernel 3.1.0 / DRBD 8.4.11 > > > > KVM virtual machines, running on top of > > DRBD (protocol A, sndbuf-size 10M, data-integrity-alg md5, external meta > data, > > meta data on different physical disk), running on top of > > LVM2, running on top of > > Software RAID 1 > > > > This night there was a MDADM verify run which was still ongoing this > morning, > > when this happened: > > > > Feb 12 07:03:27 vm-master kernel: [2009644.546758] INFO: task kvm:21152 > blocked > > for more than 120 seconds. > > Feb 12 07:03:27 vm-master kernel: [2009644.546813] "echo 0 > > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > Feb 12 07:03:27 vm-master kernel: [2009644.546896] kvm D > > ffff88067fc92f40 0 21152 1 0x00000000 > > Feb 12 07:03:27 vm-master kernel: [2009644.546901] ffff8806633a4fa0 > > 0000000000000082 ffff880600000000 ffff8806668ec0c0 > > Feb 12 07:03:27 vm-master kernel: [2009644.546904] 0000000000012f40 > > ffff88017ae01fd8 ffff88017ae01fd8 ffff8806633a4fa0 > > Feb 12 07:03:27 vm-master kernel: [2009644.546908] ffff88017ae01fd8 > > 0000000181070175 0000000000000046 ffff8806633225c0 > > Feb 12 07:03:27 vm-master kernel: [2009644.546911] Call Trace: > > Feb 12 07:03:27 vm-master kernel: [2009644.546925] [<ffffffffa00939d1>] > ? > > wait_barrier+0x87/0xc0 [raid1] > > Feb 12 07:03:27 vm-master kernel: [2009644.546931] [<ffffffff8103e3ca>] > ? > > try_to_wake_up+0x192/0x192 > > Feb 12 07:03:27 vm-master kernel: [2009644.546935] [<ffffffffa0096831>] > ? > > make_request+0x111/0x9c8 [raid1] > > Feb 12 07:03:27 vm-master kernel: [2009644.546941] [<ffffffffa0102c76>] > ? > > clone_bio+0x43/0xcb [dm_mod] > > Feb 12 07:03:27 vm-master kernel: [2009644.546946] [<ffffffffa010386f>] > ? > > __split_and_process_bio+0x4f4/0x506 [dm_mod] > > Feb 12 07:03:27 vm-master kernel: [2009644.546952] [<ffffffffa00edd93>] > ? > > md_make_request+0xce/0x1c3 [md_mod] > > Feb 12 07:03:27 vm-master kernel: [2009644.546956] [<ffffffff81188919>] > ? > > generic_make_request+0x270/0x2ea > > Feb 12 07:03:27 vm-master kernel: [2009644.546960] [<ffffffff81188a66>] > ? > > submit_bio+0xd3/0xf1 > > Feb 12 07:03:27 vm-master kernel: [2009644.546964] [<ffffffff81119181>] > ? > > __bio_add_page.part.12+0x135/0x1ed > > Feb 12 07:03:27 vm-master kernel: [2009644.546967] [<ffffffff8111bbc1>] > ? > > dio_bio_submit+0x6c/0x8a > > Feb 12 07:03:27 vm-master kernel: [2009644.546970] [<ffffffff8111becd>] > ? > > dio_send_cur_page+0x6e/0x91 > > Feb 12 07:03:27 vm-master kernel: [2009644.546972] [<ffffffff8111bfa3>] > ? > > submit_page_section+0xb3/0x11a > > Feb 12 07:03:27 vm-master kernel: [2009644.546975] [<ffffffff8111c7d8>] > ? > > __blockdev_direct_IO+0x68a/0x995 > > Feb 12 07:03:27 vm-master kernel: [2009644.546978] [<ffffffff8111a836>] > ? > > blkdev_direct_IO+0x4e/0x53 > > Feb 12 07:03:27 vm-master kernel: [2009644.546981] [<ffffffff8111ab5b>] > ? > > blkdev_get_block+0x5b/0x5b > > Feb 12 07:03:27 vm-master kernel: [2009644.546985] [<ffffffff810b0fdb>] > ? > > generic_file_direct_write+0xdc/0x146 > > Feb 12 07:03:27 vm-master kernel: [2009644.546987] [<ffffffff810b11d9>] > ? > > __generic_file_aio_write+0x194/0x278 > > Feb 12 07:03:27 vm-master kernel: [2009644.546992] [<ffffffff810131f1>] > ? > > paravirt_read_tsc+0x5/0x8 > > Feb 12 07:03:27 vm-master kernel: [2009644.546995] [<ffffffff810f4501>] > ? > > rw_copy_check_uvector+0x48/0xf8 > > Feb 12 07:03:27 vm-master kernel: [2009644.546998] [<ffffffff8111ac12>] > ? > > bd_may_claim+0x2e/0x2e > > Feb 12 07:03:27 vm-master kernel: [2009644.547000] [<ffffffff8111ac31>] > ? > > blkdev_aio_write+0x1f/0x61 > > Feb 12 07:03:27 vm-master kernel: [2009644.547003] [<ffffffff8111ac12>] > ? > > bd_may_claim+0x2e/0x2e > > Feb 12 07:03:27 vm-master kernel: [2009644.547005] [<ffffffff811249e8>] > ? > > aio_rw_vect_retry+0x70/0x18e > > Feb 12 07:03:27 vm-master kernel: [2009644.547008] [<ffffffff81124978>] > ? > > lookup_ioctx+0x53/0x53 > > Feb 12 07:03:27 vm-master kernel: [2009644.547010] [<ffffffff811253fe>] > ? > > aio_run_iocb+0x70/0x11b > > Feb 12 07:03:27 vm-master kernel: [2009644.547013] [<ffffffff8112645b>] > ? > > do_io_submit+0x442/0x4d7 > > Feb 12 07:03:27 vm-master kernel: [2009644.547017] [<ffffffff81332e12>] > ? > > system_call_fastpath+0x16/0x1b > > > > Feb 12 07:03:27 vm-master kernel: [2009644.547265] INFO: task > md1_resync:15384 > > blocked for more than 120 seconds. > > Feb 12 07:03:27 vm-master kernel: [2009644.547319] "echo 0 > > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > Feb 12 07:03:27 vm-master kernel: [2009644.547411] md1_resync D > > ffff88067fc12f40 0 15384 2 0x00000000 > > Feb 12 07:03:27 vm-master kernel: [2009644.547414] ffff880537dc0ee0 > > 0000000000000046 0000000000000000 ffffffff8160d020 > > Feb 12 07:03:27 vm-master kernel: [2009644.547418] 0000000000012f40 > > ffff8801212affd8 ffff8801212affd8 ffff880537dc0ee0 > > Feb 12 07:03:27 vm-master kernel: [2009644.547421] 0000000000011210 > > ffffffff81070175 0000000000000046 ffff8806633225c0 > > Feb 12 07:03:27 vm-master kernel: [2009644.547424] Call Trace: > > Feb 12 07:03:27 vm-master kernel: [2009644.547429] [<ffffffff81070175>] > ? > > arch_local_irq_save+0x11/0x17 > > Feb 12 07:03:27 vm-master kernel: [2009644.547433] [<ffffffffa0093917>] > ? > > raise_barrier+0x11a/0x14d [raid1] > > Feb 12 07:03:27 vm-master kernel: [2009644.547436] [<ffffffff8103e3ca>] > ? > > try_to_wake_up+0x192/0x192 > > Feb 12 07:03:27 vm-master kernel: [2009644.547440] [<ffffffffa0095fc7>] > ? > > sync_request+0x192/0x70b [raid1] > > Feb 12 07:03:27 vm-master kernel: [2009644.547446] [<ffffffffa00f147c>] > ? > > md_do_sync+0x760/0xb64 [md_mod] > > Feb 12 07:03:27 vm-master kernel: [2009644.547450] [<ffffffff8103493d>] > ? > > set_task_rq+0x23/0x35 > > Feb 12 07:03:27 vm-master kernel: [2009644.547454] [<ffffffff8105ec6b>] > ? > > add_wait_queue+0x3c/0x3c > > Feb 12 07:03:27 vm-master kernel: [2009644.547459] [<ffffffffa00ee28a>] > ? > > md_thread+0x101/0x11f [md_mod] > > Feb 12 07:03:27 vm-master kernel: [2009644.547464] [<ffffffffa00ee189>] > ? > > md_rdev_init+0xea/0xea [md_mod] > > Feb 12 07:03:27 vm-master kernel: [2009644.547467] [<ffffffff8105e625>] > ? > > kthread+0x76/0x7e > > Feb 12 07:03:27 vm-master kernel: [2009644.547470] [<ffffffff81334f74>] > ? > > kernel_thread_helper+0x4/0x10 > > Feb 12 07:03:27 vm-master kernel: [2009644.547473] [<ffffffff8105e5af>] > ? > > kthread_worker_fn+0x139/0x139 > > Feb 12 07:03:27 vm-master kernel: [2009644.547475] [<ffffffff81334f70>] > ? > > gs_change+0x13/0x13 > > > > Also, at 06:50 one of the virtual machines did start to generate heavy > I/O load > > (backup). So this probably together with the resync run set the scene > for this > > to happen. > > > > Afterwards all DRBD devices were "blocking", e.g. all virtual machines > running > > on these devices were unresponsive and had to be shutdown "cold". > > > > I also tried to disconnect & reconnect to the peer the DRBD devices, but > to no > > avail. The server had to be restarted. > > > > Are there any known problems with DRBD on top of a MDADM raid? I noticed > that > > unlike for regular access from userspace, DRBD access to the RAID seems > to not > > throttle the verify/rebuild run down to idle speed (1000K/s) but rather > keep > > going at nearly full speed ?! > > > > Thanks, Andreas > > _______________________________________________ > > drbd-user mailing list > > drbd-user at lists.linbit.com > > http://lists.linbit.com/mailman/listinfo/drbd-user > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20120214/d04396ce/attachment.htm>