[DRBD-user] Re: drbd 0.7.20 lockup

Bradley Baetz bradley.baetz at optusnet.com.au
Fri Jul 14 15:22:11 CEST 2006

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


On Fri, Jul 14, 2006 at 10:46:27PM +1000, Bradley Baetz wrote:
> [please cc me on replies; I'm not subscribed to the list]

> If I do reproduce it, I'll see if I can get a tcpdump too.

...and I managed it. To reproduce:

Reboot the secondary
Wait for it to come up and sync
hb_standby the master

postgres/kjournald gets stuck in D, and heartbeat's sanity checks kick
in and it tries reboot -f, which also fails in D state

I also have:

[root at dbtools01 ~]# cat /proc/drbd
version: 0.7.20 (api:79/proto:74)
SVN Revision: 2260 build by root at build03, 2006-07-14 10:39:06
 0: cs:Connected st:Primary/Secondary ld:Consistent
    ns:760504 nr:452 dw:49876 dr:742518 al:11 bm:311 lo:0 pe:3 ua:0 ap:3
[root at dbtools02 ~]# cat /proc/drbd
version: 0.7.20 (api:79/proto:74)
SVN Revision: 2260 build by root at build03, 2006-07-14 10:39:06
 0: cs:Connected st:Secondary/Primary ld:Consistent
    ns:0 nr:208 dw:208 dr:0 al:0 bm:6 lo:0 pe:0 ua:0 ap:0

reboot the master with reboot -n -f, and when it comes back up:

[root at dbtools01 ~]# cat /proc/drbd
version: 0.7.20 (api:79/proto:74)
SVN Revision: 2260 build by root at build03, 2006-07-14 10:39:06
 0: cs:SyncSource st:Primary/Secondary ld:Consistent
    ns:68892 nr:0 dw:0 dr:72000 al:0 bm:223 lo:1 pe:7 ua:777 ap:0
        [=>..................] sync'ed:  8.4% (811772/880640)K
        stalled
[root at dbtools02 ~]# cat /proc/drbd
version: 0.7.20 (api:79/proto:74)
SVN Revision: 2260 build by root at build03, 2006-07-14 10:39:06
 0: cs:SyncTarget st:Secondary/Secondary ld:Inconsistent
    ns:0 nr:69076 dw:69076 dr:0 al:0 bm:14 lo:0 pe:819 ua:0 ap:0
        [=>..................] sync'ed:  8.4% (811772/880640)K
        stalled

tcpdump shows the same data being sent from the secondary to the primary:

83 74 02 67 00 0d 00 00

which is being acked appropriately, and then resent again and again and
again and....

I don't suppose theres an ethereal plugin for the DRBD protocol? :)

mount on the primary is stuck in D state:

Jul 14 23:16:15 dbtools01 kernel: mount         D 00000000  2184  2907   2887                     (NOTLB)
Jul 14 23:16:15 dbtools01 kernel: f5d82ac8 00000082 c16ab2e0 00000000 00000000 00000001 00001000 00000001
Jul 14 23:16:15 dbtools01 kernel:        00000001 00000001 f7fcef30 c1816de0 00000001 00029d50 ba9003e8 00000061
Jul 14 23:16:15 dbtools01 kernel:        f7e110b0 f619c130 f619c29c f593365c 00000001 f5933284 00000246 f593328c
Jul 14 23:16:15 dbtools01 kernel: Call Trace:
Jul 14 23:16:15 dbtools01 kernel:  [<c02cfbb1>] __down+0x81/0xdb
Jul 14 23:16:15 dbtools01 kernel:  [<c011e71b>] default_wake_function+0x0/0xc
Jul 14 23:16:15 dbtools01 kernel:  [<c02cfd28>] __down_failed+0x8/0xc
Jul 14 23:16:15 dbtools01 kernel:  [<f8af934c>] .text.lock.drbd_main+0x41/0x18c [drbd]
Jul 14 23:16:15 dbtools01 kernel:  [<f8af263c>] drbd_make_request_common+0x499/0x744 [drbd]
Jul 14 23:16:15 dbtools01 kernel:  [<f885e89c>] __split_bio+0xfd/0x103 [dm_mod]
Jul 14 23:16:15 dbtools01 kernel:  [<f8af2aa7>] drbd_make_request_26+0x1c0/0x1c9 [drbd]
Jul 14 23:16:15 dbtools01 kernel:  [<c022431c>] generic_make_request+0x18e/0x19e
Jul 14 23:16:15 dbtools01 kernel:  [<c0120291>] autoremove_wake_function+0x0/0x2d
Jul 14 23:16:15 dbtools01 kernel:  [<c02243f6>] submit_bio+0xca/0xd2
Jul 14 23:16:15 dbtools01 kernel:  [<c015e7c9>] bio_alloc+0x100/0x168
Jul 14 23:16:15 dbtools01 kernel:  [<c015e180>] submit_bh+0x141/0x166
Jul 14 23:16:15 dbtools01 kernel:  [<c015cc59>] __block_write_full_page+0x1f0/0x2ea
Jul 14 23:16:15 dbtools01 kernel:  [<c0160822>] blkdev_get_block+0x0/0x46
Jul 14 23:16:15 dbtools01 kernel:  [<c015dfc8>] block_write_full_page+0xc5/0xce
Jul 14 23:16:15 dbtools01 kernel:  [<c0160822>] blkdev_get_block+0x0/0x46
Jul 14 23:16:15 dbtools01 kernel:  [<c0177bfa>] mpage_writepages+0x1c2/0x314
Jul 14 23:16:15 dbtools01 kernel:  [<c0160915>] blkdev_writepage+0x0/0xc
Jul 14 23:16:15 dbtools01 kernel:  [<c0144a4d>] do_writepages+0x19/0x27
Jul 14 23:16:15 dbtools01 kernel:  [<c013f8f7>] __filemap_fdatawrite_range+0x7a/0x85
Jul 14 23:16:15 dbtools01 kernel:  [<c013f911>] filemap_fdatawrite+0xf/0x13
Jul 14 23:16:15 dbtools01 kernel:  [<c015b5d9>] sync_blockdev+0x18/0x32
Jul 14 23:16:15 dbtools01 kernel:  [<f88714b6>] journal_recover+0xa2/0xab [jbd]
Jul 14 23:16:15 dbtools01 kernel:  [<f887410e>] journal_load+0x3c/0x6b [jbd]
Jul 14 23:16:15 dbtools01 kernel:  [<f88f760e>] ext3_load_journal+0x124/0x160 [ext3]
Jul 14 23:16:15 dbtools01 kernel:  [<f88f6f4e>] ext3_fill_super+0x70c/0x9a2 [ext3]
Jul 14 23:16:15 dbtools01 kernel:  [<c016025f>] get_sb_bdev+0xe3/0x120
Jul 14 23:16:15 dbtools01 kernel:  [<c02d0ca2>] __cond_resched+0x14/0x39
Jul 14 23:16:15 dbtools01 kernel:  [<f88f7f42>] ext3_get_sb+0xe/0x11 [ext3]
Jul 14 23:16:15 dbtools01 kernel:  [<f88f6842>] ext3_fill_super+0x0/0x9a2 [ext3]
Jul 14 23:16:15 dbtools01 kernel:  [<c016042b>] do_kern_mount+0x8a/0x147
Jul 14 23:16:15 dbtools01 kernel:  [<c01732a3>] do_new_mount+0x61/0x90
Jul 14 23:16:15 dbtools01 kernel:  [<c01738f0>] do_mount+0x178/0x190
Jul 14 23:16:15 dbtools01 kernel:  [<c0173c47>] sys_mount+0x91/0x108
Jul 14 23:16:15 dbtools01 kernel:  [<c02d268f>] syscall_call+0x7/0xb

Its a bit odd that its doing this. Is there perhaps some bio that DRBD
isn't handling properly? And when it comes back up and replays the ext3
journal, it hits the same bio again?

Bradley



More information about the drbd-user mailing list