[DRBD-user] DRBD 8.4.1 fails to bring up stacked resource on debian 2.6.32-5-amd64

Tue Jan 31 17:14:04 CET 2012

On Mon, Jan 30, 2012 at 02:56:57PM -0600, Ronald Wells wrote:
> Hello, I am having some troubles setting up a stacked resource on
> drbd 8.4.1 with debian.  If I do this exact same process with 8.4.0
> it works.

Only that 8.4.0 is much more seriously broken than 8.4.1 :-/

> I'm testing it on a vm under vsphere 5.0.  I configured the vm as follows:
> os: debian 2.6 x 64
> cpu x 2
> mem 1gb
> hard disk1: 512mb
> hard disk2: 512gb
> network1 on vmnetwork (connected to the real network)
> network2 on vmprivatenetwork (not connected to any physical network,
> just for traffic between vms)
> 
> I installed debian with simple one partition setup on disk1, didn't
> install any additional packages, hostname drbd.
> 
> created single partition on disk2 using all available disk space.
> did the following to install drbd 8.4.1:
> 
> aptitude install make gcc flex linux-headers-$(uname -r) -y
> wget http://oss.linbit.com/drbd/8.4/drbd-8.4.1.tar.gz
> tar -zxvf drbd-8.4.1.tar.gz
> cd drbd-8.4.1
> ./configure --with-km --sysconfdir=/etc --localstatedir=/var
> make
> make install
> <<reboot>>
> 
> here is my resource definition:
> #meta.res
> 
> resource meta_lower {
>  disk /dev/sdb1;
>  device /dev/drbd0;
>  meta-disk internal;
>  on drbd{
>     address 10.50.158.1:7788;
>  }
>  on storage2 {
>     address 10.50.158.2:7788;
>  }
> }
> 
> resource meta {
>  protocol A;
>  device /dev/drbd10;
>  meta-disk internal;
>  stacked-on-top-of meta_lower {
>     address 10.50.158.101:7788;
>  }
>  on openfiler3 {
>     disk /dev/sdb1;
>     address 10.50.250.4:7788;
>  }
> }
> 
> at this point i don't have any other vms created so we're just
> dealing with the one system.
> 
> next i issue the following commands to bring up the resource for the
> first time:
> 
> drbdadm create-md meta_lower
> service drbd start
> drbdadm primary --force meta_lower
> drbdadm --stacked create-md meta
> drbdadm --stacked up meta
> 
> everything works ok until the last command then i see this on the console:
> BUG: soft lockup - CPU#0 stuck for 61s!  [drbdsetup:1229]
> 
> eventually this is the result shown in the command line:
> root at drbd:~# drbdadm --stacked up meta
> Command 'drbdsetup attach 10 /dev/drbd0 /dev/drbd0 internal' did not
> terminate within 121 seconds
> root at drbd:~#
> root at drbd:~# cat /proc/drbd
> version: 8.4.1 (api:1/proto:86-100)
> GIT-hash: 91b4c048c1a0e06777b5f65d312b38d47abaea80 build by
> root at drbd, 2012-01-27 16:18:25
> 0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C r----s
>     ns:0 nr:0 dw:16428 dr:548 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1
> wo:b oos:536851748
> 
> 10: cs:StandAlone ro:Secondary/Unknown ds:Attaching/DUnknown   r-----
>     ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:2 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
> 
> 
> and this in the /var/log/kern.log file:
> 
> Jan 27 16:21:38 drbd kernel: [  126.053260] d-con meta: Starting worker thread (from drbdsetup [1229])
> Jan 27 16:21:38 drbd kernel: [  126.053391] block drbd10: disk( Diskless -> Attaching )
> Jan 27 16:21:38 drbd kernel: [  126.059363] d-con meta: Method to ensure write ordering: barrier

I suggest you retry and configure
	"no-disk-barriers; no-disk-flushes; no-md-flushes;"
for at least the stacked resource,
that may help working around this issue.

> Jan 27 16:21:38 drbd kernel: [  126.059370] block drbd10: max BIO size = 4096
> Jan 27 16:21:38 drbd kernel: [  126.059375] block drbd10: drbd_bm_resize called with capacity == 1073670656
> Jan 27 16:21:38 drbd kernel: [  126.068945] block drbd10: resync bitmap: bits=134208832 words=2097013 pages=4096
> Jan 27 16:21:38 drbd kernel: [  126.068950] block drbd10: size = 512 GB (536835328 KB)
> Jan 27 16:21:38 drbd kernel: [  126.068981] block drbd10: Writing the whole bitmap, size changed
> Jan 27 16:21:38 drbd kernel: [  126.305549] block drbd10: bitmap WRITE of 4096 pages took 60 jiffies
> Jan 27 16:21:43 drbd kernel: [  131.052868] block drbd10: md_sync_timer expired! Worker calls drbd_md_sync().
> Jan 27 16:22:43 drbd kernel: [  191.128234] BUG: soft lockup - CPU#0 stuck for 61s! [drbdsetup:1229]
> Jan 27 16:22:43 drbd kernel: [  191.128400] Modules linked in: drbd
> crc32c libcrc32c loop snd_pcm snd_timer snd soundcore snd_page_alloc
> parport_pc parport evdev psmouse serio_raw pcspkr i2c_piix4 shpchp
> pci_hotplug i2c_core ac container processor button ext3 jbd mbcache
> sg sd_mod crc_t10dif sr_mod cdrom ata_generic mptspi ata_piix
> mptscsih floppy mptbase scsi_transport_spi e1000 libata thermal
> thermal_sys scsi_mod [last unloaded: scsi_wait_scan]
> Jan 27 16:22:43 drbd kernel: [  191.128449] CPU 0:
> Jan 27 16:22:43 drbd kernel: [  191.128469] Pid: 1229, comm: drbdsetup Not tainted 2.6.32-5-amd64 #1 VMware Virtual Platform
> Jan 27 16:22:43 drbd kernel: [  191.128471] RIP: 0010:[<ffffffff81180efd>]  [<ffffffff81180efd>] bio_end_empty_barrier+0x12/0x24
> Jan 27 16:22:43 drbd kernel: [  191.128496] RSP: 0018:ffff88003a3037e0  EFLAGS: 00000282
> Jan 27 16:22:43 drbd kernel: [  191.128498] RAX: ffff880039361258 RBX: ffff88003d5c3800 RCX: 0000000000000000
> Jan 27 16:22:43 drbd kernel: [  191.128499] RDX: 0000000000000000 RSI: 00000000ffffffa1 RDI: ffff880039361240
> Jan 27 16:22:43 drbd kernel: [  191.128501] RBP: ffffffff8101166e R08: ffff88003a303960 R09: ffffffff813aec89
> Jan 27 16:22:43 drbd kernel: [  191.128503] R10: 0000000000000000 R11: ffffffff81180eeb R12: ffff88003a303960
> Jan 27 16:22:43 drbd kernel: [  191.128504] R13: ffffffff813aec89 R14: 0000000000000000 R15: ffffffff81180eeb
> Jan 27 16:22:43 drbd kernel: [  191.128541] FS: 00007fd9e7726700(0000) GS:ffff880001800000(0000) knlGS:0000000000000000
> Jan 27 16:22:43 drbd kernel: [  191.128543] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> Jan 27 16:22:43 drbd kernel: [  191.128545] CR2: 00007f588962c577 CR3: 000000003a34e000 CR4: 00000000000006f0
> Jan 27 16:22:43 drbd kernel: [  191.128564] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> Jan 27 16:22:43 drbd kernel: [  191.128579] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Jan 27 16:22:43 drbd kernel: [  191.128580] Call Trace:

> Jan 27 16:22:43 drbd kernel: [  191.128602]  [<ffffffffa02882fc>] ? drbd_make_request+0x2b/0x14b [drbd]
> Jan 27 16:22:43 drbd kernel: [  191.128605]  [<ffffffff81180eeb>] ? bio_end_empty_barrier+0x0/0x24

This does not compute.

Why would bio_end_empty_barrier() call drbd_make_request?
Or, why would bio_end_empty_barrier() call anything at all?
That's an atomic bio completion function that set at most two bits,
and does one completion.

It can not possibly wait for anything (but the spinlock of the
completion; which would not be reported as "soft lockup", but as a
spinlock deadlock).

The call path actually should be
 bm_rw -> blkdev_issue_flush -> submit_bio -> generic_make_request ->
drbd_make_request -> bio_endio(, -EOPNOTSUPP), which ends up in
bio_end_empty_barrier(), which, as indicated above, should be atomic.

Configuring DRBD with no-md-flushes etc. as indicated above would
not even try to call blkdev_issue_flush, thus skipping this completely.

> Jan 27 16:22:43 drbd kernel: [  191.128610]  [<ffffffff8117e24b>] ? generic_make_request+0x299/0x2f9
> Jan 27 16:22:43 drbd kernel: [  191.128612]  [<ffffffff8117e381>] ? submit_bio+0xd6/0xf2
> Jan 27 16:22:43 drbd kernel: [  191.128615]  [<ffffffff81180c84>] ? blkdev_issue_flush+0x78/0xc1
> Jan 27 16:22:43 drbd kernel: [  191.128622]  [<ffffffffa027710a>] ? bm_rw+0x2f2/0x3f8 [drbd]
> Jan 27 16:22:43 drbd kernel: [  191.128626]  [<ffffffffa0277232>] ? drbd_bm_write+0x0/0xe [drbd]
> Jan 27 16:22:43 drbd kernel: [  191.128632]  [<ffffffffa028be8f>] ? drbd_bitmap_io+0x86/0xab [drbd]
> Jan 27 16:22:43 drbd kernel: [  191.128637]  [<ffffffffa0295f84>] ? drbd_determine_dev_size+0x2e0/0x367 [drbd]
> Jan 27 16:22:43 drbd kernel: [  191.128640]  [<ffffffff81181b15>] ? blk_queue_stack_limits+0x6e/0x85
> Jan 27 16:22:43 drbd kernel: [  191.128644]  [<ffffffffa0296bd9>] ? drbd_adm_attach+0x8b0/0xca2 [drbd]
> Jan 27 16:22:43 drbd kernel: [  191.128656]  [<ffffffff8103fa2a>] ? __wake_up+0x30/0x44
> Jan 27 16:22:43 drbd kernel: [  191.128664]  [<ffffffff8119cb45>] ? nla_parse+0x4b/0xb2
> Jan 27 16:22:43 drbd kernel: [  191.128673]  [<ffffffff8126bca5>] ? genl_rcv_msg+0x1d9/0x201
> Jan 27 16:22:43 drbd kernel: [  191.128676]  [<ffffffff8126bacc>] ? genl_rcv_msg+0x0/0x201
> Jan 27 16:22:43 drbd kernel: [  191.128678]  [<ffffffff8126ad20>] ? netlink_rcv_skb+0x34/0x7c
> Jan 27 16:22:43 drbd kernel: [  191.128680]  [<ffffffff8126babf>] ? genl_rcv+0x1f/0x2c
> Jan 27 16:22:43 drbd kernel: [  191.128682]  [<ffffffff8126ab14>] ? netlink_unicast+0xe2/0x148
> Jan 27 16:22:43 drbd kernel: [  191.128688]  [<ffffffff81248999>] ? __alloc_skb+0x69/0x15a
> Jan 27 16:22:43 drbd kernel: [  191.128690]  [<ffffffff8126b240>] ? netlink_sendmsg+0x242/0x255
> Jan 27 16:22:43 drbd kernel: [  191.128695]  [<ffffffff81240b7c>] ? sock_aio_write+0xb1/0xbc
> Jan 27 16:22:43 drbd kernel: [  191.128707]  [<ffffffff810b41cb>] ? find_get_page+0x1a/0x77
> Jan 27 16:22:43 drbd kernel: [  191.128712]  [<ffffffff810cad1a>] ? __do_fault+0x38c/0x3c3
> Jan 27 16:22:43 drbd kernel: [  191.128719]  [<ffffffff810eebf2>] ? do_sync_write+0xce/0x113
> Jan 27 16:22:43 drbd kernel: [  191.128727]  [<ffffffff81064f92>] ? autoremove_wake_function+0x0/0x2e
> Jan 27 16:22:43 drbd kernel: [  191.128729]  [<ffffffff810ef557>] ? vfs_write+0xbc/0x102
> Jan 27 16:22:43 drbd kernel: [  191.128731]  [<ffffffff810ef659>] ? sys_write+0x45/0x6e
> Jan 27 16:22:43 drbd kernel: [  191.128737]  [<ffffffff81010b42>] ? system_call_fastpath+0x16/0x1b

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed