<div dir="ltr"><div><div>So no ideas concerning this, then? I've seen the same thing happen on another resource, now. Actually, it doesn't need to be a snapshot: removing any logical volume causes the oops. It doesn't happen for every resource, though. I wonder if it's something to do with the frequency of other I/O? They both have intermittent spikes in I/O (from databases), but on average are not under heavy load. I've tried destroying the resource completely, re-creating both sides from scratch, creating a new LV on the resource and copying the data back onto it, but the same thing is happening again (oops on remote when I create and remove an LV).<br><br></div>What can I do to debug this further?<br><br></div>Paul<br></div><div class="gmail_extra"><br><div class="gmail_quote">On 16 June 2015 at 11:51, Paul Gideon Dann <span dir="ltr"><<a href="mailto:pdgiddie@gmail.com" target="_blank">pdgiddie@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div>This is an interesting (though frustrating) issue that I've run into with DRBD+LVM, and having finally exhausted everything I can think of or find myself, I'm hoping the mailing list might be able to offer some help!<br><br></div><div>My setup involves DRBD resources that are backed by LVM LVs, and then formatted as PVs themselves, each forming its own VG.<br><br></div><div>System VG -> Backing LV -> DRBD -> Resource VG -> Resource LVs<br><br></div><div>The problem I'm having happens only for one DRBD resource, and not for any of the others. This is what I do:<br><br></div><div>I create a snapshot of the Resource LV (meaning that the snapshot will also be replicated via DRBD), and everything is fine. However, when I *remove* the snapshot, the *secondary* peer oopses immediately:<br><br>====================<br>[ 738.167953] BUG: unable to handle kernel NULL pointer dereference at (null)<br>[ 738.167984] IP: [<ffffffffc09176fc>] drbd_endio_write_sec_final+0x9c/0x490 [drbd]<br>[ 738.168004] PGD 0 <br>[ 738.168010] Oops: 0002 [#1] SMP <br>[ 738.168028] Modules linked in: dm_snapshot dm_bufio vhost_net vhost macvtap macvlan ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables x_tables 8021q garp mrp drbd lru_cache libcrc32c bridge stp llc adt7475 hwmon_vid nouveau mxm_wmi wmi video ttm drm_kms_helper <br>[ 738.168192] CPU: 5 PID: 1963 Comm: drbd_r_vm-sql-s Not tainted 3.16.0-39-generic #53~14.04.1-Ubuntu<br>[ 738.168199] Hardware name: Intel S5000XVN/S5000XVN, BIOS S5000.86B.10.00.0084.101720071530 10/17/2007<br>[ 738.168206] task: ffff8808292632f0 ti: ffff880824b60000 task.ti: ffff880824b60000<br>[ 738.168212] RIP: 0010:[<ffffffffc09176fc>] [<ffffffffc09176fc>] drbd_endio_write_sec_final+0x9c/0x490 [drbd]<br>[ 738.168225] RSP: 0018:ffff880824b63ca0 EFLAGS: 00010093<br>[ 738.168230] RAX: 0000000000000000 RBX: ffff88081647de80 RCX: 000000000000b028<br>[ 738.168236] RDX: ffff88081647da00 RSI: 0000000000000202 RDI: ffff880829cc26d0<br>[ 738.168242] RBP: ffff880824b63d18 R08: 0000000000000246 R09: 0000000000000002<br>[ 738.168247] R10: 0000000000000246 R11: 0000000000000005 R12: ffff88082f8ffae0<br>[ 738.168253] R13: ffff8804b5f46428 R14: ffff880829fd9800 R15: ffff880829fd9bb0<br>[ 738.168259] FS: 0000000000000000(0000) GS:ffff88085fd40000(0000) knlGS:0000000000000000<br>[ 738.168265] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b<br>[ 738.168270] CR2: 0000000000000000 CR3: 000000082a097000 CR4: 00000000000027e0<br>[ 738.168276] Stack:<br>[ 738.168279] ffff880824b63ca8 ffff880800000000 0000000000060006 ffff88081647deb8<br>[ 738.168290] 0000000000000000 0000000000000000 0000000007600800 0000000000400000<br>[ 738.168300] 0000000000000000 0000000000000000 0000000007600800 0000000000000000<br>[ 738.168310] Call Trace:<br>[ 738.168321] [<ffffffffc0927696>] drbd_submit_peer_request+0x86/0x360 [drbd]<br>[ 738.168333] [<ffffffffc09282d1>] receive_Data+0x3a1/0xfa0 [drbd]<br>[ 738.168342] [<ffffffffc091c73a>] ? drbd_recv+0x2a/0x1c0 [drbd]<br>[ 738.168353] [<ffffffffc092a255>] drbd_receiver+0x115/0x250 [drbd]<br>[ 738.168364] [<ffffffffc09345a0>] ? drbd_destroy_connection+0xc0/0xc0 [drbd]<br>[ 738.168375] [<ffffffffc09345eb>] drbd_thread_setup+0x4b/0x130 [drbd]<br>[ 738.168385] [<ffffffffc09345a0>] ? drbd_destroy_connection+0xc0/0xc0 [drbd]<br>[ 738.168395] [<ffffffff81091522>] kthread+0xd2/0xf0<br>[ 738.168402] [<ffffffff81091450>] ? kthread_create_on_node+0x1c0/0x1c0<br>[ 738.168410] [<ffffffff8176dd98>] ret_from_fork+0x58/0x90<br>[ 738.168416] [<ffffffff81091450>] ? kthread_create_on_node+0x1c0/0x1c0<br>[ 738.168422] Code: 48 8d b8 d0 00 00 00 e8 73 62 e5 c0 8b 53 58 49 89 c2 c1 ea 09 41 01 96 54 02 00 00 49 83 fd ff 48 8b 13 48 8b 43 08 48 89 42 08 <48> 89 10 49 8b 86 c8 03 00 00 49 8d 96 c0 03 00 00 49 89 9e c8 <br>[ 738.168513] RIP [<ffffffffc09176fc>] drbd_endio_write_sec_final+0x9c/0x490 [drbd]<br>[ 738.168524] RSP <ffff880824b63ca0><br>[ 738.168528] CR2: 0000000000000000<br>====================<br><br></div><div>At that point, the IO stack seems to be completely frozen up: the drbd kernel threads are stuck in D state, and the system becomes completely unresponsive.<br><br></div><div>The system is Ubuntu Trusty 14.04.<br></div><div>Kernel is 3.16.0-39-generic<br></div><div>drbd-utils is 2:8.4.4-1ubuntu1<br><br></div><div>DRBD config for the resource is:<br><br>====================<br>
<div>
<span style="font-family:monospace"><span style="color:rgb(0,0,0);background-color:rgb(255,255,255)">resource vm-sql-server {
</span><br> device /dev/drbd5;
<br> meta-disk internal;
<br> net {
<br> protocol A;
<br> }
<br> on mars {
<br> disk /dev/mars/drbd-backend-sql-server;
<br> address <a href="http://192.168.254.101:7794" target="_blank">192.168.254.101:7794</a>;
<br> }
<br> on venus {
<br> disk /dev/venus/drbd-backend-sql-server;
<br> address <a href="http://192.168.254.102:7794" target="_blank">192.168.254.102:7794</a>;
<br> }
<br>}</span><br>====================</div>
</div><div><br></div><div>My LVM filter looks like this:<br>filter = [ "a|^/dev/sd.[0-9]+|" "a|^/dev/md[0-9]+|" "a|^/dev/drbd/by-res/.*|" "r|.*|" ]<br></div><div><br></div><div>I've tried switching the protocol to C, and I've tried completely resyncing the secondary. I'm out of ideas. Any help would be greatly appreciated!<br><br></div><div>Cheers,<br></div><div>Paul<br></div></div>
</blockquote></div><br></div>