[DRBD-user] Concurrent local write detected!

Thu Mar 2 20:40:01 CET 2017

On Thu, Mar 02, 2017 at 03:07:52AM -0500, Digimer wrote:
> Hi all,
> 
>   We had an event last night on a system that's been in production for a
> couple of years; DRBD 8.3.16. At almost exactly midnight, both nodes
> threw these errors:
> 
> =====
> eb 28 03:42:01 aae-a01n01 rsyslogd: [origin software="rsyslogd"
> swVersion="5.8.10" x-pid="1729" x-info="http://www.rsyslog.com"]
> rsyslogd was HUPed
> Mar  2 00:00:07 aae-a01n01 kernel: block drbd0: drbd0_receiver[4763]
> Concurrent local write detected!   new: 622797696s +4096; pending:
> 622797696s +4096
> Mar  2 00:00:07 aae-a01n01 kernel: block drbd0: Concurrent write! [W
> AFTERWARDS] sec=622797696s
> Mar  2 00:00:07 aae-a01n01 kernel: block drbd0: Got DiscardAck packet
> 622797696s +4096! DRBD is not a random data generator!
> Mar  2 00:00:17 aae-a01n01 kernel: block drbd0: qemu-kvm[20305]
> Concurrent remote write detected! [DISCARD L] new: 673151680s +32768;
> pending: 673151712s +16384

...

> [root at aae-a01n02 ~]# cat /proc/drbd
> version: 8.3.16 (api:88/proto:86-97)
> GIT-hash: a798fa7e274428a357657fb52f0ecf40192c1985 build by
> root at rhel6-builder-production.alteeve.ca, 2015-04-05 19:59:27
>  0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
>     ns:408 nr:2068182 dw:2068586 dr:48408 al:8 bm:115 lo:0 pe:0 ua:0
> ap:0 ep:1 wo:f oos:0
>  1: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
>     ns:750365 nr:770052 dw:1520413 dr:1062911 al:15463 bm:145 lo:0 pe:0
> ua:0 ap:0 ep:1 wo:f oos:0

> At this point, storage hung (I assume on purpose). Recovery was a full
> restart of the cluster.
>
> Googling doesn't return much on this. Can someone provide insight into
> what might have happened? This was a pretty scary event, and it's the
> first time I've seen it happen in all the years I've been using DRBD.
> 
> Let me know if there are any other logs or info

I guess drbd0 is your GFS2?

What DRBD tries to tell you there is that, while one WRITE was still
"in flight", there was a new WRITE to the same LBA.

Both nodes wrote to the exact same LBA at virtually the same time.

GFS2 is supposed to coordinate write (and read) activity from both nodes
such that this won't happen.

If the /proc/drbd above is from during that "storage hang",
it indicates that DRBD is not hung at all (nor any requests),
but completely "idle" (lo:0, pe:0, ua:0, ap:0 ... nothing in flight).

If it was hung, it was not DRBD.

In that case, my best guess is that the layer above DRBD screwed up,
and both the un-coordinated writes to the same blocks,
and the hang are symptoms of the same underlying problem.

If you can, force an fsck,
and/or figure out what is located at those LBAs
(the ####s + ### are start sector + IO size in bytes).

-- 
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker

DRBD® and LINBIT® are registered trademarks of LINBIT
__
please don't Cc me, but send to list -- I'm subscribed