[DRBD-user] Kernel Oops on peer when removing LVM snapshot

Mon Jun 22 11:06:40 CEST 2015

So no ideas concerning this, then? I've seen the same thing happen on
another resource, now. Actually, it doesn't need to be a snapshot: removing
any logical volume causes the oops. It doesn't happen for every resource,
though. I wonder if it's something to do with the frequency of other I/O?
They both have intermittent spikes in I/O (from databases), but on average
are not under heavy load. I've tried destroying the resource completely,
re-creating both sides from scratch, creating a new LV on the resource and
copying the data back onto it, but the same thing is happening again (oops
on remote when I create and remove an LV).

What can I do to debug this further?

Paul

On 16 June 2015 at 11:51, Paul Gideon Dann <pdgiddie at gmail.com> wrote:

> This is an interesting (though frustrating) issue that I've run into with
> DRBD+LVM, and having finally exhausted everything I can think of or find
> myself, I'm hoping the mailing list might be able to offer some help!
>
> My setup involves DRBD resources that are backed by LVM LVs, and then
> formatted as PVs themselves, each forming its own VG.
>
> System VG -> Backing LV -> DRBD -> Resource VG -> Resource LVs
>
> The problem I'm having happens only for one DRBD resource, and not for any
> of the others. This is what I do:
>
> I create a snapshot of the Resource LV (meaning that the snapshot will
> also be replicated via DRBD), and everything is fine. However, when I
> *remove* the snapshot, the *secondary* peer oopses immediately:
>
> ====================
> [  738.167953] BUG: unable to handle kernel NULL pointer dereference
> at           (null)
> [  738.167984] IP: [<ffffffffc09176fc>]
> drbd_endio_write_sec_final+0x9c/0x490 [drbd]
> [  738.168004] PGD 0
> [  738.168010] Oops: 0002 [#1] SMP
> [  738.168028] Modules linked in: dm_snapshot dm_bufio vhost_net vhost
> macvtap macvlan ip6table_filter ip6_tables iptable_filter ip_tables
> ebtable_nat ebtables x_tables 8021q garp mrp drbd lru_cache libcrc32c
> bridge stp llc adt7475 hwmon_vid nouveau mxm_wmi wmi video ttm
> drm_kms_helper
> [  738.168192] CPU: 5 PID: 1963 Comm: drbd_r_vm-sql-s Not tainted
> 3.16.0-39-generic #53~14.04.1-Ubuntu
> [  738.168199] Hardware name: Intel S5000XVN/S5000XVN, BIOS
> S5000.86B.10.00.0084.101720071530 10/17/2007
> [  738.168206] task: ffff8808292632f0 ti: ffff880824b60000 task.ti:
> ffff880824b60000
> [  738.168212] RIP: 0010:[<ffffffffc09176fc>]  [<ffffffffc09176fc>]
> drbd_endio_write_sec_final+0x9c/0x490 [drbd]
> [  738.168225] RSP: 0018:ffff880824b63ca0  EFLAGS: 00010093
> [  738.168230] RAX: 0000000000000000 RBX: ffff88081647de80 RCX:
> 000000000000b028
> [  738.168236] RDX: ffff88081647da00 RSI: 0000000000000202 RDI:
> ffff880829cc26d0
> [  738.168242] RBP: ffff880824b63d18 R08: 0000000000000246 R09:
> 0000000000000002
> [  738.168247] R10: 0000000000000246 R11: 0000000000000005 R12:
> ffff88082f8ffae0
> [  738.168253] R13: ffff8804b5f46428 R14: ffff880829fd9800 R15:
> ffff880829fd9bb0
> [  738.168259] FS:  0000000000000000(0000) GS:ffff88085fd40000(0000)
> knlGS:0000000000000000
> [  738.168265] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [  738.168270] CR2: 0000000000000000 CR3: 000000082a097000 CR4:
> 00000000000027e0
> [  738.168276] Stack:
> [  738.168279]  ffff880824b63ca8 ffff880800000000 0000000000060006
> ffff88081647deb8
> [  738.168290]  0000000000000000 0000000000000000 0000000007600800
> 0000000000400000
> [  738.168300]  0000000000000000 0000000000000000 0000000007600800
> 0000000000000000
> [  738.168310] Call Trace:
> [  738.168321]  [<ffffffffc0927696>] drbd_submit_peer_request+0x86/0x360
> [drbd]
> [  738.168333]  [<ffffffffc09282d1>] receive_Data+0x3a1/0xfa0 [drbd]
> [  738.168342]  [<ffffffffc091c73a>] ? drbd_recv+0x2a/0x1c0 [drbd]
> [  738.168353]  [<ffffffffc092a255>] drbd_receiver+0x115/0x250 [drbd]
> [  738.168364]  [<ffffffffc09345a0>] ? drbd_destroy_connection+0xc0/0xc0
> [drbd]
> [  738.168375]  [<ffffffffc09345eb>] drbd_thread_setup+0x4b/0x130 [drbd]
> [  738.168385]  [<ffffffffc09345a0>] ? drbd_destroy_connection+0xc0/0xc0
> [drbd]
> [  738.168395]  [<ffffffff81091522>] kthread+0xd2/0xf0
> [  738.168402]  [<ffffffff81091450>] ? kthread_create_on_node+0x1c0/0x1c0
> [  738.168410]  [<ffffffff8176dd98>] ret_from_fork+0x58/0x90
> [  738.168416]  [<ffffffff81091450>] ? kthread_create_on_node+0x1c0/0x1c0
> [  738.168422] Code: 48 8d b8 d0 00 00 00 e8 73 62 e5 c0 8b 53 58 49 89 c2
> c1 ea 09 41 01 96 54 02 00 00 49 83 fd ff 48 8b 13 48 8b 43 08 48 89 42 08
> <48> 89 10 49 8b 86 c8 03 00 00 49 8d 96 c0 03 00 00 49 89 9e c8
> [  738.168513] RIP  [<ffffffffc09176fc>]
> drbd_endio_write_sec_final+0x9c/0x490 [drbd]
> [  738.168524]  RSP <ffff880824b63ca0>
> [  738.168528] CR2: 0000000000000000
> ====================
>
> At that point, the IO stack seems to be completely frozen up: the drbd
> kernel threads are stuck in D state, and the system becomes completely
> unresponsive.
>
> The system is Ubuntu Trusty 14.04.
> Kernel is 3.16.0-39-generic
> drbd-utils is 2:8.4.4-1ubuntu1
>
> DRBD config for the resource is:
>
> ====================
>  resource vm-sql-server {
>  device /dev/drbd5;
>  meta-disk internal;
>  net {
>    protocol A;
>  }
>  on mars {
>    disk /dev/mars/drbd-backend-sql-server;
>    address 192.168.254.101:7794;
>  }
>  on venus {
>    disk /dev/venus/drbd-backend-sql-server;
>    address 192.168.254.102:7794;
>  }
> }
> ====================
>
> My LVM filter looks like this:
> filter = [ "a|^/dev/sd.[0-9]+|" "a|^/dev/md[0-9]+|"
> "a|^/dev/drbd/by-res/.*|" "r|.*|" ]
>
> I've tried switching the protocol to C, and I've tried completely
> resyncing the secondary. I'm out of ideas. Any help would be greatly
> appreciated!
>
> Cheers,
> Paul
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20150622/835bc989/attachment.htm>