[DRBD-user] Kernel Oops on peer when removing LVM snapshot

Tue Jun 23 14:15:39 CEST 2015

OK, I think I finally discovered the root trigger for this. The problem
seems to be the TRIM that is sent down through DRBD by lvremove. When
removing the LV, lvremove issues a TRIM on the underlying storage, which in
this case is DRBD. DRBD (which supports TRIM) sends the command over the
wire, and this apparently triggers some bug at the other end.
(Incidentally, the local disk is SSD; the bug has triggered both with SSD
and rotational remote disks.)

The workaround is to set "issue_discards = 0" in /etc/lvm/lvm.conf. After
initial testing, it seems to have fixed the issue completely, and I'm not
fussed about the loss of automatic TRIMs.

Hope this helps someone else in the same situation!

Is there a process for bug reporting?

Paul

On 22 June 2015 at 10:06, Paul Gideon Dann <pdgiddie at gmail.com> wrote:

> So no ideas concerning this, then? I've seen the same thing happen on
> another resource, now. Actually, it doesn't need to be a snapshot: removing
> any logical volume causes the oops. It doesn't happen for every resource,
> though. I wonder if it's something to do with the frequency of other I/O?
> They both have intermittent spikes in I/O (from databases), but on average
> are not under heavy load. I've tried destroying the resource completely,
> re-creating both sides from scratch, creating a new LV on the resource and
> copying the data back onto it, but the same thing is happening again (oops
> on remote when I create and remove an LV).
>
> What can I do to debug this further?
>
> Paul
>
> On 16 June 2015 at 11:51, Paul Gideon Dann <pdgiddie at gmail.com> wrote:
>
>> This is an interesting (though frustrating) issue that I've run into with
>> DRBD+LVM, and having finally exhausted everything I can think of or find
>> myself, I'm hoping the mailing list might be able to offer some help!
>>
>> My setup involves DRBD resources that are backed by LVM LVs, and then
>> formatted as PVs themselves, each forming its own VG.
>>
>> System VG -> Backing LV -> DRBD -> Resource VG -> Resource LVs
>>
>> The problem I'm having happens only for one DRBD resource, and not for
>> any of the others. This is what I do:
>>
>> I create a snapshot of the Resource LV (meaning that the snapshot will
>> also be replicated via DRBD), and everything is fine. However, when I
>> *remove* the snapshot, the *secondary* peer oopses immediately:
>>
>> ====================
>> [  738.167953] BUG: unable to handle kernel NULL pointer dereference
>> at           (null)
>> [  738.167984] IP: [<ffffffffc09176fc>]
>> drbd_endio_write_sec_final+0x9c/0x490 [drbd]
>> [  738.168004] PGD 0
>> [  738.168010] Oops: 0002 [#1] SMP
>> [  738.168028] Modules linked in: dm_snapshot dm_bufio vhost_net vhost
>> macvtap macvlan ip6table_filter ip6_tables iptable_filter ip_tables
>> ebtable_nat ebtables x_tables 8021q garp mrp drbd lru_cache libcrc32c
>> bridge stp llc adt7475 hwmon_vid nouveau mxm_wmi wmi video ttm
>> drm_kms_helper
>> [  738.168192] CPU: 5 PID: 1963 Comm: drbd_r_vm-sql-s Not tainted
>> 3.16.0-39-generic #53~14.04.1-Ubuntu
>> [  738.168199] Hardware name: Intel S5000XVN/S5000XVN, BIOS
>> S5000.86B.10.00.0084.101720071530 10/17/2007
>> [  738.168206] task: ffff8808292632f0 ti: ffff880824b60000 task.ti:
>> ffff880824b60000
>> [  738.168212] RIP: 0010:[<ffffffffc09176fc>]  [<ffffffffc09176fc>]
>> drbd_endio_write_sec_final+0x9c/0x490 [drbd]
>> [  738.168225] RSP: 0018:ffff880824b63ca0  EFLAGS: 00010093
>> [  738.168230] RAX: 0000000000000000 RBX: ffff88081647de80 RCX:
>> 000000000000b028
>> [  738.168236] RDX: ffff88081647da00 RSI: 0000000000000202 RDI:
>> ffff880829cc26d0
>> [  738.168242] RBP: ffff880824b63d18 R08: 0000000000000246 R09:
>> 0000000000000002
>> [  738.168247] R10: 0000000000000246 R11: 0000000000000005 R12:
>> ffff88082f8ffae0
>> [  738.168253] R13: ffff8804b5f46428 R14: ffff880829fd9800 R15:
>> ffff880829fd9bb0
>> [  738.168259] FS:  0000000000000000(0000) GS:ffff88085fd40000(0000)
>> knlGS:0000000000000000
>> [  738.168265] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>> [  738.168270] CR2: 0000000000000000 CR3: 000000082a097000 CR4:
>> 00000000000027e0
>> [  738.168276] Stack:
>> [  738.168279]  ffff880824b63ca8 ffff880800000000 0000000000060006
>> ffff88081647deb8
>> [  738.168290]  0000000000000000 0000000000000000 0000000007600800
>> 0000000000400000
>> [  738.168300]  0000000000000000 0000000000000000 0000000007600800
>> 0000000000000000
>> [  738.168310] Call Trace:
>> [  738.168321]  [<ffffffffc0927696>] drbd_submit_peer_request+0x86/0x360
>> [drbd]
>> [  738.168333]  [<ffffffffc09282d1>] receive_Data+0x3a1/0xfa0 [drbd]
>> [  738.168342]  [<ffffffffc091c73a>] ? drbd_recv+0x2a/0x1c0 [drbd]
>> [  738.168353]  [<ffffffffc092a255>] drbd_receiver+0x115/0x250 [drbd]
>> [  738.168364]  [<ffffffffc09345a0>] ? drbd_destroy_connection+0xc0/0xc0
>> [drbd]
>> [  738.168375]  [<ffffffffc09345eb>] drbd_thread_setup+0x4b/0x130 [drbd]
>> [  738.168385]  [<ffffffffc09345a0>] ? drbd_destroy_connection+0xc0/0xc0
>> [drbd]
>> [  738.168395]  [<ffffffff81091522>] kthread+0xd2/0xf0
>> [  738.168402]  [<ffffffff81091450>] ? kthread_create_on_node+0x1c0/0x1c0
>> [  738.168410]  [<ffffffff8176dd98>] ret_from_fork+0x58/0x90
>> [  738.168416]  [<ffffffff81091450>] ? kthread_create_on_node+0x1c0/0x1c0
>> [  738.168422] Code: 48 8d b8 d0 00 00 00 e8 73 62 e5 c0 8b 53 58 49 89
>> c2 c1 ea 09 41 01 96 54 02 00 00 49 83 fd ff 48 8b 13 48 8b 43 08 48 89 42
>> 08 <48> 89 10 49 8b 86 c8 03 00 00 49 8d 96 c0 03 00 00 49 89 9e c8
>> [  738.168513] RIP  [<ffffffffc09176fc>]
>> drbd_endio_write_sec_final+0x9c/0x490 [drbd]
>> [  738.168524]  RSP <ffff880824b63ca0>
>> [  738.168528] CR2: 0000000000000000
>> ====================
>>
>> At that point, the IO stack seems to be completely frozen up: the drbd
>> kernel threads are stuck in D state, and the system becomes completely
>> unresponsive.
>>
>> The system is Ubuntu Trusty 14.04.
>> Kernel is 3.16.0-39-generic
>> drbd-utils is 2:8.4.4-1ubuntu1
>>
>> DRBD config for the resource is:
>>
>> ====================
>>  resource vm-sql-server {
>>  device /dev/drbd5;
>>  meta-disk internal;
>>  net {
>>    protocol A;
>>  }
>>  on mars {
>>    disk /dev/mars/drbd-backend-sql-server;
>>    address 192.168.254.101:7794;
>>  }
>>  on venus {
>>    disk /dev/venus/drbd-backend-sql-server;
>>    address 192.168.254.102:7794;
>>  }
>> }
>> ====================
>>
>> My LVM filter looks like this:
>> filter = [ "a|^/dev/sd.[0-9]+|" "a|^/dev/md[0-9]+|"
>> "a|^/dev/drbd/by-res/.*|" "r|.*|" ]
>>
>> I've tried switching the protocol to C, and I've tried completely
>> resyncing the secondary. I'm out of ideas. Any help would be greatly
>> appreciated!
>>
>> Cheers,
>> Paul
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20150623/36097f0f/attachment.htm>