Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On 02/03/17 02:40 PM, Lars Ellenberg wrote: > On Thu, Mar 02, 2017 at 03:07:52AM -0500, Digimer wrote: >> Hi all, >> >> We had an event last night on a system that's been in production for a >> couple of years; DRBD 8.3.16. At almost exactly midnight, both nodes >> threw these errors: >> >> ===== >> eb 28 03:42:01 aae-a01n01 rsyslogd: [origin software="rsyslogd" >> swVersion="5.8.10" x-pid="1729" x-info="http://www.rsyslog.com"] >> rsyslogd was HUPed >> Mar 2 00:00:07 aae-a01n01 kernel: block drbd0: drbd0_receiver[4763] >> Concurrent local write detected! new: 622797696s +4096; pending: >> 622797696s +4096 >> Mar 2 00:00:07 aae-a01n01 kernel: block drbd0: Concurrent write! [W >> AFTERWARDS] sec=622797696s >> Mar 2 00:00:07 aae-a01n01 kernel: block drbd0: Got DiscardAck packet >> 622797696s +4096! DRBD is not a random data generator! >> Mar 2 00:00:17 aae-a01n01 kernel: block drbd0: qemu-kvm[20305] >> Concurrent remote write detected! [DISCARD L] new: 673151680s +32768; >> pending: 673151712s +16384 > > > ... > >> [root at aae-a01n02 ~]# cat /proc/drbd >> version: 8.3.16 (api:88/proto:86-97) >> GIT-hash: a798fa7e274428a357657fb52f0ecf40192c1985 build by >> root at rhel6-builder-production.alteeve.ca, 2015-04-05 19:59:27 >> 0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r----- >> ns:408 nr:2068182 dw:2068586 dr:48408 al:8 bm:115 lo:0 pe:0 ua:0 >> ap:0 ep:1 wo:f oos:0 >> 1: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r----- >> ns:750365 nr:770052 dw:1520413 dr:1062911 al:15463 bm:145 lo:0 pe:0 >> ua:0 ap:0 ep:1 wo:f oos:0 > >> At this point, storage hung (I assume on purpose). Recovery was a full >> restart of the cluster. >> >> Googling doesn't return much on this. Can someone provide insight into >> what might have happened? This was a pretty scary event, and it's the >> first time I've seen it happen in all the years I've been using DRBD. >> >> Let me know if there are any other logs or info > > I guess drbd0 is your GFS2? > > What DRBD tries to tell you there is that, while one WRITE was still > "in flight", there was a new WRITE to the same LBA. > > Both nodes wrote to the exact same LBA at virtually the same time. > > GFS2 is supposed to coordinate write (and read) activity from both nodes > such that this won't happen. > > If the /proc/drbd above is from during that "storage hang", > it indicates that DRBD is not hung at all (nor any requests), > but completely "idle" (lo:0, pe:0, ua:0, ap:0 ... nothing in flight). > > If it was hung, it was not DRBD. > > In that case, my best guess is that the layer above DRBD screwed up, > and both the un-coordinated writes to the same blocks, > and the hang are symptoms of the same underlying problem. > > If you can, force an fsck, > and/or figure out what is located at those LBAs > (the ####s + ### are start sector + IO size in bytes). It's looking like, somehow, one of the servers was booted on both nodes at the same time... We're still trying to get access to take a closer look to confirm though. If that is what happened, it would match what you're saying (and is a massive problem). So when this happened, DRBD hung on purpose? If so, that's amazing as the hung storage appears to have saved us. madi -- Digimer Papers and Projects: https://alteeve.com/w/ "I am, somehow, less interested in the weight and convolutions of Einstein’s brain than in the near certainty that people of equal talent have lived and died in cotton fields and sweatshops." - Stephen Jay Gould