[Drbd-dev] Possible obscure bug

Tue Nov 10 17:41:13 CET 2020

Hi all,

  DRBD v8.4.11.

  We had a case where a client was upgrading from 1Gbps to 10Gbps NICs
for backing DRBD. The interface was an mode=1 (active/passive) bond, two
interfaces. We needed to switch one of the bonded interfaces to a new
physical NIC port. This is a process I've done countless times before,
but this was the first time I did it while DRBD was up on one node.

  We disconnected the peer that wasn't needed, so DRBD was running alone
on the active node. We 'ifdown sn_link1' (the name of the interface we
were about to change, which had been the active link). Confirmed with
/proc/net/bond/sn_bond1 (bond name) that the sn_link1 (interface) was
out of the bond and the bond was using 'sn_link2' (interface that was
not to be changed). We also confirmed that the ethX device that was
going to be moved over was also down.

  We updated /etc/udev/rules.d/70-persistent-net.rules to swap the MAC,
then did 'start_udev' to rename the interface. Normally this returns in
a couple seconds, but in this instance, it took a couple of minutes to
return. When it did, sn_link1 didn't exist anymore and the bond proc
file also showed that the interface hadn't come up.

  OK, so all this so far wasn't a big deal. The real concern was that
all of the VMs that were on LVs backed by DRBD acted like they lost
their hard drives. I was tail'ing syslog and there were no entries from
DRBD at all.

  I had to stop DRBD and reboot the node to recover. On reboot, the
interface (sn_link1) came up properly on the new NIC and DRBD started
normally.

  I fully understand "well don't do that" as a "fix", and I certainly
will not try this again.

  I'm writing this though as I think it might be an indication of a
deeper issue that might bite others in the future. It seems like DRBD
"held open" the interface under the bonded interface, despite the NIC
being down'ed. That DRBD totally stopped allowing disk access without
any log messages to indicate it hit a problem makes me thing this is not
a known failure condition.

  One more interesting tidbit;

  There's a single DRBD resource on this system, acting as a PV for a
clustered VG. Each VM gets an LV, and there's one additional LV for a
GFS2 partition. The gfs2 partition was not in use when the rename of the
interface was requested, and oddly enough, I _could_ write to it after.
So it seems like only parts that were in use hung, while parts that were
not.

  Writing this out, I also wonder if this might be an LVM issue, and not
a drbd issue?

  Any insight/feedback would be much appreciated. It was quite the
pantaloon soiling event, and I'd like to understand just what happened.

digimer

-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould