Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi all, We had an event last night on a system that's been in production for a couple of years; DRBD 8.3.16. At almost exactly midnight, both nodes threw these errors: ===== eb 28 03:42:01 aae-a01n01 rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="1729" x-info="http://www.rsyslog.com"] rsyslogd was HUPed Mar 2 00:00:07 aae-a01n01 kernel: block drbd0: drbd0_receiver[4763] Concurrent local write detected! new: 622797696s +4096; pending: 622797696s +4096 Mar 2 00:00:07 aae-a01n01 kernel: block drbd0: Concurrent write! [W AFTERWARDS] sec=622797696s Mar 2 00:00:07 aae-a01n01 kernel: block drbd0: Got DiscardAck packet 622797696s +4096! DRBD is not a random data generator! Mar 2 00:00:17 aae-a01n01 kernel: block drbd0: qemu-kvm[20305] Concurrent remote write detected! [DISCARD L] new: 673151680s +32768; pending: 673151712s +16384 Mar 2 00:00:17 aae-a01n01 kernel: block drbd0: qemu-kvm[20305] Concurrent remote write detected! [DISCARD L] new: 673151712s +16384; pending: 673151712s +16384 Mar 2 00:00:17 aae-a01n01 kernel: block drbd0: qemu-kvm[20305] Concurrent remote write detected! [DISCARD L] new: 673151712s +16384; pending: 673151712s +16384 Mar 2 00:00:17 aae-a01n01 kernel: block drbd0: qemu-kvm[20305] Concurrent remote write detected! [DISCARD L] new: 673151712s +16384; pending: 673151712s +16384 Mar 2 00:00:17 aae-a01n01 kernel: block drbd0: qemu-kvm[20305] Concurrent remote write detected! [DISCARD L] new: 673151712s +16384; pending: 673151712s +16384 ===== ===== Feb 28 03:23:01 aae-a01n02 rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="1729" x-info="http://www.rsyslog.com"] rsyslogd was HUPed Mar 2 00:00:07 aae-a01n02 kernel: block drbd0: drbd0_receiver[4758] Concurrent local write detected! new: 622797696s +4096; pending: 622797696s +4096 Mar 2 00:00:07 aae-a01n02 kernel: block drbd0: Concurrent write! [DISCARD BY FLAG] sec=622797696s Mar 2 00:00:11 aae-a01n02 kernel: block drbd0: qemu-kvm[15639] Concurrent remote write detected! [DISCARD L] new: 622797696s +4096; pending: 622797696s +4096 Mar 2 00:00:18 aae-a01n02 kernel: block drbd0: qemu-kvm[15639] Concurrent remote write detected! [DISCARD L] new: 673151712s +16384; pending: 673151712s +16384 Mar 2 00:00:18 aae-a01n02 kernel: block drbd0: qemu-kvm[15639] Concurrent remote write detected! [DISCARD L] new: 673151712s +16384; pending: 673151712s +16384 Mar 2 00:00:18 aae-a01n02 kernel: block drbd0: drbd0_receiver[4758] Concurrent local write detected! new: 673151712s +16384; pending: 673151712s +16384 Mar 2 00:00:18 aae-a01n02 kernel: block drbd0: Concurrent write! [DISCARD BY FLAG] sec=673151712s Mar 2 00:00:18 aae-a01n02 kernel: block drbd0: qemu-kvm[15639] Concurrent remote write detected! [DISCARD L] new: 673151712s +16384; pending: 673151712s +16384 Mar 2 00:00:18 aae-a01n02 kernel: block drbd0: drbd0_receiver[4758] Concurrent local write detected! new: 673151744s +16384; pending: 673151744s +16384 Mar 2 00:00:18 aae-a01n02 kernel: block drbd0: Concurrent write! [W AFTERWARDS] sec=673151744s Mar 2 00:00:18 aae-a01n02 kernel: block drbd0: qemu-kvm[15639] Concurrent remote write detected! [DISCARD L] new: 673151744s +16384; pending: 673151744s +16384 Mar 2 00:00:18 aae-a01n02 kernel: block drbd0: qemu-kvm[15639] Concurrent remote write detected! [DISCARD L] new: 673151744s +16384; pending: 673151744s +16384 Mar 2 00:00:18 aae-a01n02 kernel: block drbd0: qemu-kvm[15639] Concurrent remote write detected! [DISCARD L] new: 673151744s +16384; pending: 673151744s +16384 Mar 2 00:00:18 aae-a01n02 kernel: block drbd0: qemu-kvm[15639] Concurrent remote write detected! [DISCARD L] new: 673151744s +16384; pending: 673151744s +16384 Mar 2 00:00:18 aae-a01n02 kernel: block drbd0: qemu-kvm[15639] Concurrent remote write detected! [DISCARD L] new: 673151744s +16384; pending: 673151744s +16384 Mar 2 00:00:18 aae-a01n02 kernel: block drbd0: qemu-kvm[15639] Concurrent remote write detected! [DISCARD L] new: 673151744s +16384; pending: 673151744s +16384 ===== ===== [root at aae-a01n02 ~]# drbdadm dump-xml <config file="/etc/drbd.conf"> <common protocol="C"> <section name="net"> <option name="allow-two-primaries"/> <option name="after-sb-0pri" value="discard-zero-changes"/> <option name="after-sb-1pri" value="discard-secondary"/> <option name="after-sb-2pri" value="disconnect"/> </section> <section name="disk"> <option name="fencing" value="resource-and-stonith"/> </section> <section name="syncer"> <option name="rate" value="30M"/> </section> <section name="startup"> <option name="wfc-timeout" value="300"/> <option name="degr-wfc-timeout" value="120"/> <option name="outdated-wfc-timeout" value="120"/> <option name="become-primary-on" value="both"/> </section> <section name="handlers"> <option name="pri-on-incon-degr" value="/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f"/> <option name="pri-lost-after-sb" value="/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f"/> <option name="local-io-error" value="/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f"/> <option name="fence-peer" value="/usr/lib/drbd/rhcs_fence"/> </section> </common> <resource name="r0"> <host name="aae-a01n01.hwholdings.com"> <device minor="0">/dev/drbd0</device> <disk>/dev/sda5</disk> <address family="ipv4" port="7788">10.10.10.1</address> <meta-disk>internal</meta-disk> </host> <host name="aae-a01n02.hwholdings.com"> <device minor="0">/dev/drbd0</device> <disk>/dev/sda5</disk> <address family="ipv4" port="7788">10.10.10.2</address> <meta-disk>internal</meta-disk> </host> </resource> <resource name="r1"> <host name="aae-a01n01.hwholdings.com"> <device minor="1">/dev/drbd1</device> <disk>/dev/sda6</disk> <address family="ipv4" port="7789">10.10.10.1</address> <meta-disk>internal</meta-disk> </host> <host name="aae-a01n02.hwholdings.com"> <device minor="1">/dev/drbd1</device> <disk>/dev/sda6</disk> <address family="ipv4" port="7789">10.10.10.2</address> <meta-disk>internal</meta-disk> </host> </resource> </config> ===== ===== [root at aae-a01n02 ~]# cat /proc/drbd version: 8.3.16 (api:88/proto:86-97) GIT-hash: a798fa7e274428a357657fb52f0ecf40192c1985 build by root at rhel6-builder-production.alteeve.ca, 2015-04-05 19:59:27 0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r----- ns:408 nr:2068182 dw:2068586 dr:48408 al:8 bm:115 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0 1: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r----- ns:750365 nr:770052 dw:1520413 dr:1062911 al:15463 bm:145 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0 ===== At this point, storage hung (I assume on purpose). Recovery was a full restart of the cluster. Googling doesn't return much on this. Can someone provide insight into what might have happened? This was a pretty scary event, and it's the first time I've seen it happen in all the years I've been using DRBD. Let me know if there are any other logs or info Thanks! digimer -- Digimer Papers and Projects: https://alteeve.com/w/ "I am, somehow, less interested in the weight and convolutions of Einstein’s brain than in the near certainty that people of equal talent have lived and died in cotton fields and sweatshops." - Stephen Jay Gould