[DRBD-user] Concurrent local write detected!

Thu Mar 2 22:20:03 CET 2017

On Thu, Mar 02, 2017 at 02:56:58PM -0500, Digimer wrote:
> >> [root at aae-a01n02 ~]# cat /proc/drbd
> >> version: 8.3.16 (api:88/proto:86-97)
> >> GIT-hash: a798fa7e274428a357657fb52f0ecf40192c1985 build by
> >> root at rhel6-builder-production.alteeve.ca, 2015-04-05 19:59:27
> >>  0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
> >>     ns:408 nr:2068182 dw:2068586 dr:48408 al:8 bm:115 lo:0 pe:0 ua:0
> >> ap:0 ep:1 wo:f oos:0
> >>  1: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
> >>     ns:750365 nr:770052 dw:1520413 dr:1062911 al:15463 bm:145 lo:0 pe:0
> >> ua:0 ap:0 ep:1 wo:f oos:0
> > 
> >> At this point, storage hung (I assume on purpose). Recovery was a full
> >> restart of the cluster.
> >>
> >> Googling doesn't return much on this. Can someone provide insight into
> >> what might have happened? This was a pretty scary event, and it's the
> >> first time I've seen it happen in all the years I've been using DRBD.
> >>
> >> Let me know if there are any other logs or info
> > 
> > I guess drbd0 is your GFS2?
> > 
> > What DRBD tries to tell you there is that, while one WRITE was still
> > "in flight", there was a new WRITE to the same LBA.
> > 
> > Both nodes wrote to the exact same LBA at virtually the same time.
> > 
> > GFS2 is supposed to coordinate write (and read) activity from both nodes
> > such that this won't happen.
> > 
> > If the /proc/drbd above is from during that "storage hang",
> > it indicates that DRBD is not hung at all (nor any requests),
> > but completely "idle" (lo:0, pe:0, ua:0, ap:0 ... nothing in flight).
> > 
> > If it was hung, it was not DRBD.
> > 
> > In that case, my best guess is that the layer above DRBD screwed up,
> > and both the un-coordinated writes to the same blocks,
> > and the hang are symptoms of the same underlying problem.
> > 
> > If you can, force an fsck,
> > and/or figure out what is located at those LBAs
> > (the ####s + ### are start sector + IO size in bytes).
> 
> It's looking like, somehow, one of the servers was booted on both nodes
> at the same time... We're still trying to get access to take a closer
> look to confirm though. If that is what happened, it would match what
> you're saying (and is a massive problem).
> So when this happened, DRBD hung on purpose?

Not at all.  I currently assume that DRBD did NOT hang.
But the layer on top of DRBD.

As I said above,
if that /proc/drbd from above was taken while IO appeared to be hung.
It shows that DRBD at that point does not know of a single request.
Nothing in flight, no internal, no remote, no nothing.

According to that /proc/drbd, it just waits for upper layers
(or the peer) to submit some requests, but is otherwise idle.

DRBD does not hang "on purpose" in these situations, it just noisily
complains about it, and arbitrates in a way that is supposed to
guarantee identical data on both nodes once all such "overlapping"
IO requests have been completed to upper layers.

> If so, that's amazing as the hung storage appears to have saved us.

 :-)

-- 
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker

DRBD® and LINBIT® are registered trademarks of LINBIT
__
please don't Cc me, but send to list -- I'm subscribed