[DRBD-user] Concurrent local write detected!

Thu Mar 2 20:56:58 CET 2017

On 02/03/17 02:40 PM, Lars Ellenberg wrote:
> On Thu, Mar 02, 2017 at 03:07:52AM -0500, Digimer wrote:
>> Hi all,
>>
>>   We had an event last night on a system that's been in production for a
>> couple of years; DRBD 8.3.16. At almost exactly midnight, both nodes
>> threw these errors:
>>
>> =====
>> eb 28 03:42:01 aae-a01n01 rsyslogd: [origin software="rsyslogd"
>> swVersion="5.8.10" x-pid="1729" x-info="http://www.rsyslog.com"]
>> rsyslogd was HUPed
>> Mar  2 00:00:07 aae-a01n01 kernel: block drbd0: drbd0_receiver[4763]
>> Concurrent local write detected!   new: 622797696s +4096; pending:
>> 622797696s +4096
>> Mar  2 00:00:07 aae-a01n01 kernel: block drbd0: Concurrent write! [W
>> AFTERWARDS] sec=622797696s
>> Mar  2 00:00:07 aae-a01n01 kernel: block drbd0: Got DiscardAck packet
>> 622797696s +4096! DRBD is not a random data generator!
>> Mar  2 00:00:17 aae-a01n01 kernel: block drbd0: qemu-kvm[20305]
>> Concurrent remote write detected! [DISCARD L] new: 673151680s +32768;
>> pending: 673151712s +16384
> 
> 
> ...
> 
>> [root at aae-a01n02 ~]# cat /proc/drbd
>> version: 8.3.16 (api:88/proto:86-97)
>> GIT-hash: a798fa7e274428a357657fb52f0ecf40192c1985 build by
>> root at rhel6-builder-production.alteeve.ca, 2015-04-05 19:59:27
>>  0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
>>     ns:408 nr:2068182 dw:2068586 dr:48408 al:8 bm:115 lo:0 pe:0 ua:0
>> ap:0 ep:1 wo:f oos:0
>>  1: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
>>     ns:750365 nr:770052 dw:1520413 dr:1062911 al:15463 bm:145 lo:0 pe:0
>> ua:0 ap:0 ep:1 wo:f oos:0
> 
>> At this point, storage hung (I assume on purpose). Recovery was a full
>> restart of the cluster.
>>
>> Googling doesn't return much on this. Can someone provide insight into
>> what might have happened? This was a pretty scary event, and it's the
>> first time I've seen it happen in all the years I've been using DRBD.
>>
>> Let me know if there are any other logs or info
> 
> I guess drbd0 is your GFS2?
> 
> What DRBD tries to tell you there is that, while one WRITE was still
> "in flight", there was a new WRITE to the same LBA.
> 
> Both nodes wrote to the exact same LBA at virtually the same time.
> 
> GFS2 is supposed to coordinate write (and read) activity from both nodes
> such that this won't happen.
> 
> If the /proc/drbd above is from during that "storage hang",
> it indicates that DRBD is not hung at all (nor any requests),
> but completely "idle" (lo:0, pe:0, ua:0, ap:0 ... nothing in flight).
> 
> If it was hung, it was not DRBD.
> 
> In that case, my best guess is that the layer above DRBD screwed up,
> and both the un-coordinated writes to the same blocks,
> and the hang are symptoms of the same underlying problem.
> 
> If you can, force an fsck,
> and/or figure out what is located at those LBAs
> (the ####s + ### are start sector + IO size in bytes).

It's looking like, somehow, one of the servers was booted on both nodes
at the same time... We're still trying to get access to take a closer
look to confirm though. If that is what happened, it would match what
you're saying (and is a massive problem). So when this happened, DRBD
hung on purpose? If so, that's amazing as the hung storage appears to
have saved us.

madi

-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould