<div dir="ltr">To follow up, one of our other engineers may have discovered why drbd won't auto-promote in this use case. Turns out in zfs 0.7.12 the device is being opened with the flag FMODE_EXCL being passed into blkdev_get_by_path() which drbd isn't detecting during auto-promote. He plans to back port a change from zfs 8 that will use the flag FMODE_WRITE which should trigger the code path to auto-promote the resource in drbd.</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Nov 15, 2019 at 2:57 PM Doug Cahill <<a href="mailto:handruin@gmail.com">handruin@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><br></div><div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Nov 15, 2019 at 4:34 AM Robert Altnoeder <<a href="mailto:robert.altnoeder@linbit.com" target="_blank">robert.altnoeder@linbit.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Could you try a few things, so we can get a better picture of what's<br>
happening there:<br>
- Can you get a hash of the data on the backend device that DRBD is<br>
writing to, from before and after one of those dubious Secondary-mode<br>
writes, to verify whether or not any data is actually changed?<br></blockquote><div><br></div><div>Looks like the sha256sum changes on both primary and secondary local backing devices to drbd when I write to my zpool that has the drbd log device.</div><div><br></div><div>node0 before sha256sum</div><div>[root@dccdx0 ~]# dd if=/dev/sda1 bs=8M iflag=direct | sha256sum<br>17179869184 bytes (17 GB) copied, 69.4921 s, 247 MB/s<br>730d0f908c64a42ffc168211350b87a72c72ed56de2feef4be0904342acf20ac -<br></div><div><br></div><div>node1 before sha256sum</div><div>[root@dccdx1 ~]# dd if=/dev/sda1 bs=8M iflag=direct | sha256sum<br>17179869184 bytes (17 GB) copied, 70.4586 s, 244 MB/s<br>adbab9ee2a96ed476fe649cd10dc17994767190ae350a7be146c40427e272a73 -<br></div><div><br></div><div>Write test:</div><div>[root@dccdx0 ~]# dd if=/dev/urandom of=/dev/zvol/act_per_pool000/test_drbd bs=4k count=100000 oflag=sync,direct<br>409600000 bytes (410 MB) copied, 44.8633 s, 9.1 MB/s<br></div><div><br></div><div>node0 after sha256sum<br></div><div><div>[root@dccdx0 ~]# dd if=/dev/sda1 bs=8M iflag=direct | sha256sum<br></div><div>17179869184 bytes (17 GB) copied, 71.6324 s, 240 MB/s<br>e8c02e50daf281973b04ea1b76e6cdb8760a789245ade987ba5410deba68067d -<br></div><div></div></div><div><br></div><div>node1 after sha256sum<br></div><div>[root@dccdx1 ~]# dd if=/dev/sda1 bs=8M iflag=direct | sha256sum<br>2048+0 records in<br>2048+0 records out<br>17179869184 bytes (17 GB) copied, 68.326 s, 251 MB/s<br><div>da8b90e8f57c20e4ea47a498157cb2865249d8b8cefc36aedb49a4467572924f -</div><div> <br></div></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
- Can you switch the other peer into the Primary role manually (so that<br>
the node where the problem occurs should refuse to become a Primary) and<br>
see what happens when ZFS tries to write to that log?<br></blockquote><div><br></div><div>Attempt 1 with zpool imported:</div><div>This is the secondary side where the zpool is not imported. I'm </div><div>[root@dccdx1 ~]# drbdadm primary r0<br>r0: State change failed: (-10) State change was refused by peer node<br>additional info from kernel:<br>Declined by peer dccdx0 (id: 1), see the kernel log there<br>Command 'drbdsetup primary r0' terminated with exit code 11<br></div><div><br></div><div>Info logged in /var/log/messages:</div><div>dccdx1: Preparing remote state change 1839699090<br></div><div>dccdx0 kernel: [171910.178046] drbd r0: State change failed: Peer may not become primary while device is opened read-only<br>dccdx0 kernel: [171910.195954] drbd r0 dccdx1: Aborting remote state change 1839699090<br></div><div><br></div><div>Attempt 2 with zpool exported:</div><div>export the pool on primary node.</div><div>secondary node I promote drbd resource to primary:</div><div>[root@dccdx1 ~]# drbdadm primary r0<br>[root@dccdx1 ~]# drbdadm status<br>r0 role:Primary<br> disk:UpToDate<br> dccdx0 role:Secondary<br> peer-disk:UpToDate<br></div><div><br></div><div>import zpool on primary node with drbd device as secondary:</div><div>[root@dccdx0 ~]# zpool import -f -o cachefile=none -d /dev/drbd/by-disk/disk/by-path -d /dev/disk/by-path -d /dev/mapper act_per_pool000<br>The devices below are missing, use '-m' to import the pool anyway:<br>         pci-0000:18:00.0-scsi-0:0:2:0-part1 [log]<br>cannot import 'act_per_pool000': one or more devices is currently unavailable<br></div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
I tried to reproduce the problem from user space (with auto-promote off<br>
and trying to read/write from/to a Secondary), where it does not seem to<br>
happen (everything is normal, cannot even read from a Secondary).<br>
However, I expect those ZFS operations to be done by some code in the<br>
kernel itself, and something may not be playing by the rules there -<br>
maybe ZFS is causing some I/O without doing a proper open/close cycle,<br>
or we are missing something in DRBD for some I/O case that's supposed to<br>
be valid.<br></blockquote><div><br></div><div>Another dev is looking into using system tap so we can debug the kernel calls to block devices to see why they aren't being flagged to open as writable. We are as puzzled how the zfs vdisk kernel call is or is not being captured so that drbd detects this to auto-promote.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
br,<br>
Robert<br>
<br>
On 11/14/19 10:28 PM, Doug Cahill wrote:<br>
> I spent some more time looking into this with another developer and I<br>
> can see while running "drbdsetup events2 r0" that there is a quick<br>
> blip when I add the drbd r0 resource to my pool as the log device:<br>
><br>
> change resource name:r0 role:Primary<br>
> change resource name:r0 role:Secondary<br>
><br>
> However, if I export and/or import the pool, the event never registers<br>
> again. When I write to a vdisk on this pool I can see the nr:11766480<br>
> dw:11766452 counts increase on the peer which leads me to believe<br>
> blocks are being written, yet the state never changes.<br>
><br>
> I also tried to run dd to the "peer" side drbd device while the<br>
> "active" side was writing data and found a message stating the peer<br>
> may not become primary while the device is opened read-only in my<br>
> syslog which doesn't make sense. The device is being written to, so<br>
> how is the block device state being tricked to thinking it is read only?<br>
><br>
> =========in the log from the node I'm writing to the drbd resource<br>
> drbd r0 dccdx0: Preparing remote state change 892694821<br>
> drbd r0: State change failed: Peer may not become primary while device<br>
> is opened read-only<br>
> kernel: [92771.927574] drbd r0 dccdx0: Aborting remote state change<br>
> 892694821<br>
><br>
> On Thu, Nov 14, 2019 at 10:39 AM Doug Cahill <<a href="mailto:handruin@gmail.com" target="_blank">handruin@gmail.com</a><br>
> <mailto:<a href="mailto:handruin@gmail.com" target="_blank">handruin@gmail.com</a>>> wrote:<br>
><br>
> On Thu, Nov 14, 2019 at 4:52 AM Roland Kammerer<br>
> <<a href="mailto:roland.kammerer@linbit.com" target="_blank">roland.kammerer@linbit.com</a> <mailto:<a href="mailto:roland.kammerer@linbit.com" target="_blank">roland.kammerer@linbit.com</a>>><br>
> wrote:<br>
><br>
> On Wed, Nov 13, 2019 at 03:08:37PM -0500, Doug Cahill wrote:<br>
> > I'm configuring a two node setup with drbd 9.0.20-1 on CentOS 7<br>
> > (3.10.0-957.1.3.el7.x86_64) with a single resource backed by<br>
> an SSDs. I've<br>
> > explicitly enabled auto-promote in my resource configuration<br>
> to use this<br>
> > feature.<br>
> ><br>
> > The drbd device is being used in a single-primary<br>
> configuration as a zpool<br>
> > SLOG device. The zpool is only ever imported on one node at<br>
> a time and the<br>
> > import is successful during cluster failover events between<br>
> nodes. I<br>
> > confirmed through zdb that the zpool includes the configured<br>
> drbd device<br>
> > path.<br>
> ><br>
> > My concern is that the drbdadm status output shows the Role<br>
> of the drbd<br>
> > resource as "Secondary" on both sides. The documentations<br>
> reads that the<br>
> > drbd resource will be auto promoted to primary when it is<br>
> opened for<br>
> > writing.<br>
><br>
> But also demoted when closed (don't know if this happens in your<br>
> scenario).<br>
><br>
> > drbdadm status<br>
> > r0 role:Secondary<br>
> > disk:UpToDate<br>
> > dccdx0 role:Secondary<br>
> > peer-disk:UpToDate<br>
><br>
> Maybe it is closed and demoted again and you look at it at the<br>
> wrong<br>
> points in time? Better look into the syslog for role changes,<br>
> or monitor<br>
> with "drbdsetup events2 r0". Do you see switches to Primary there?<br>
><br>
><br>
> I checked the drbdadm status while my dd write session was in<br>
> progress and I see no change from Secondary to Primary. I also<br>
> checked the stats under /sys/class and it looks the same.<br>
><br>
> cat /sys/kernel/debug/drbd/resources/r0/connections/dccdx0/0/proc_drbd<br>
> 0: cs:Established ro:Secondary/Secondary ds:UpToDate/UpToDate C<br>
> r-----<br>
> ns:3330728 nr:0 dw:20103080 dr:26292 al:131 bm:0 lo:0 pe:[0;0]<br>
> ua:0 ap:[0;0] ep:1 wo:1 oos:0<br>
> resync: used:0/61 hits:64 misses:4 starving:0 locked:0 changed:2<br>
> act_log: used:0/1237 hits:28951 misses:536 starving:0 locked:0<br>
> changed:132<br>
> blocked on activity log: 0/0/0<br>
><br>
><br>
> Best, rck<br>
> _______________________________________________<br>
> Star us on GITHUB: <a href="https://github.com/LINBIT" rel="noreferrer" target="_blank">https://github.com/LINBIT</a><br>
> drbd-user mailing list<br>
> <a href="mailto:drbd-user@lists.linbit.com" target="_blank">drbd-user@lists.linbit.com</a> <mailto:<a href="mailto:drbd-user@lists.linbit.com" target="_blank">drbd-user@lists.linbit.com</a>><br>
> <a href="https://lists.linbit.com/mailman/listinfo/drbd-user" rel="noreferrer" target="_blank">https://lists.linbit.com/mailman/listinfo/drbd-user</a><br>
><br>
><br>
> _______________________________________________<br>
> Star us on GITHUB: <a href="https://github.com/LINBIT" rel="noreferrer" target="_blank">https://github.com/LINBIT</a><br>
> drbd-user mailing list<br>
> <a href="mailto:drbd-user@lists.linbit.com" target="_blank">drbd-user@lists.linbit.com</a><br>
> <a href="https://lists.linbit.com/mailman/listinfo/drbd-user" rel="noreferrer" target="_blank">https://lists.linbit.com/mailman/listinfo/drbd-user</a><br>
<br>
<br>
-- <br>
Robert ALTNOEDER - Software Developer<br>
+43-1-817-82-92 x72 <tel:+4318178292><br>
<a href="mailto:robert.altnoeder@linbit.com" target="_blank">robert.altnoeder@linbit.com</a> <mailto:<a href="mailto:robert.altnoeder@linbit.com" target="_blank">robert.altnoeder@linbit.com</a>><br>
<br>
LIN <<a href="http://www.linbit.com/en/" rel="noreferrer" target="_blank">http://www.linbit.com/en/</a>>BIT <<a href="http://www.linbit.com/en/" rel="noreferrer" target="_blank">http://www.linbit.com/en/</a>> | Keeping<br>
the Digital World Running<br>
DRBD HA - Disaster Recovery - Software-defined Storage<br>
t <<a href="https://twitter.com/linbit" rel="noreferrer" target="_blank">https://twitter.com/linbit</a>> / f<br>
<<a href="https://www.facebook.com/pg/linbitdrbd/posts/" rel="noreferrer" target="_blank">https://www.facebook.com/pg/linbitdrbd/posts/</a>> / in<br>
<<a href="https://www.linkedin.com/company/linbit" rel="noreferrer" target="_blank">https://www.linkedin.com/company/linbit</a>> / y<br>
<<a href="https://www.youtube.com/user/linbit" rel="noreferrer" target="_blank">https://www.youtube.com/user/linbit</a>> / g+<br>
<<a href="https://plus.google.com/+Linbit/about" rel="noreferrer" target="_blank">https://plus.google.com/+Linbit/about</a>><br>
<br>
DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.<br>
_______________________________________________<br>
Star us on GITHUB: <a href="https://github.com/LINBIT" rel="noreferrer" target="_blank">https://github.com/LINBIT</a><br>
drbd-user mailing list<br>
<a href="mailto:drbd-user@lists.linbit.com" target="_blank">drbd-user@lists.linbit.com</a><br>
<a href="https://lists.linbit.com/mailman/listinfo/drbd-user" rel="noreferrer" target="_blank">https://lists.linbit.com/mailman/listinfo/drbd-user</a><br>
</blockquote></div></div>
</div>
</blockquote></div>