<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
I backported the hack from 9.0.17-1, now under same conditions node
still floods the log with "al_comlete_io()" and "LOGIC BUG"
messages, but at least it does not crash.<br>
<br>
<pre class="moz-signature" cols="72">Ing. Evzen Demcenko
Senior Linux Administrator
Cluster Design s.r.o.
</pre>
<div class="moz-cite-prefix">On 5/13/19 6:43 PM, Evzen Demcenko
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:6cab89a6-47ac-d0a5-b551-d1e4902f6808@cldn.cz">If drbd
node in primary-primary setup looses disk by any reason (faulty
disk, controller etc., or even manual detach) and there is R/W
activity on both nodes, the second node eventually crashes with
kernel panic "kernel BUG at
/root/rpmbuild/BUILD/drbd-8.4.11-1/drbd/lru_cache.c:570!"
<br>
Before the crash there are a lot of messages in kernel.log on good
node (with attached disk) like
<br>
<br>
block drbd1: al_complete_io() called on inactive extent 57
<br>
block drbd1: LOGIC BUG for enr=74
<br>
<br>
Eventually, the "good" node crashes within minutes or hours
depending on disk activity leaving the cluster without any data.
<br>
I tested different versions from 8.4 tree (8.4.6, 8.4.7-1,
8.4.9-1, 8.4.11-1), always with the same result.
<br>
There is also no difference on "real-hardware" and virtualized
machines.
<br>
kernel is 2.6.32-754.12.1.el6.x86_64 on Centos-6.10 with latest
updates. Tested also on other kernels with the same outcome.
<br>
There is a vmcore-dmesg.txt attached to this email, vmcore itself
is available as well for every tested version, core files are
35-50Mb, so i can't attach them to email, but i'll be glad to
share them in some other way if needed.
<br>
<br>
[root@drtest-11 ~]# cat /etc/drbd.d/global_common.conf
<br>
global {
<br>
usage-count yes;
<br>
}
<br>
<br>
common {
<br>
protocol C;
<br>
<br>
handlers {
<br>
pri-on-incon-degr
"/usr/lib/drbd/notify-pri-on-incon-degr.sh;
/usr/lib/drbd/notify-emergency-reboot.sh";
<br>
pri-lost-after-sb
"/usr/lib/drbd/notify-pri-lost-after-sb.sh;
/usr/lib/drbd/notify-emergency-reboot.sh";
<br>
local-io-error "/usr/lib/drbd/notify-io-error.sh;
/usr/lib/drbd/notify-emergency-shutdown.sh";
<br>
}
<br>
<br>
startup {
<br>
# wfc-timeout degr-wfc-timeout
outdated-wfc-timeout wait-after-sb
<br>
}
<br>
<br>
disk {
<br>
resync-rate 100M;
<br>
on-io-error detach;
<br>
al-extents 1447;
<br>
c-plan-ahead 32;
<br>
c-max-rate 1000M;
<br>
c-min-rate 80M;
<br>
c-fill-target 65536k;
<br>
}
<br>
<br>
net {
<br>
sndbuf-size 4096k;
<br>
rcvbuf-size 4096k;
<br>
timeout 100; # 10 seconds (unit = 0.1
seconds)
<br>
connect-int 15; # 15 seconds (unit = 1
second)
<br>
ping-int 15; # 15 seconds (unit = 1
second)
<br>
ping-timeout 50; # 5000 ms (unit = 0.1
seconds)
<br>
max-buffers 131072;
<br>
max-epoch-size 20000;
<br>
ko-count 0;
<br>
after-sb-0pri discard-younger-primary;
<br>
after-sb-1pri consensus;
<br>
after-sb-2pri disconnect;
<br>
rr-conflict disconnect;
<br>
}
<br>
}
<br>
<br>
[root@drtest-11 ~]# cat /etc/drbd.d/r1.res
<br>
resource r1 {
<br>
net {
<br>
protocol C;
<br>
allow-two-primaries;
<br>
verify-alg crc32c;
<br>
csums-alg crc32c;
<br>
}
<br>
startup {
<br>
become-primary-on both;
<br>
}
<br>
disk {
<br>
disk-timeout 1200;
<br>
}
<br>
on drtest-11.uvt.internal {
<br>
device /dev/drbd1;
<br>
disk "/dev/vdb";
<br>
address 10.0.11.201:7790;
<br>
flexible-meta-disk internal;
<br>
}
<br>
on drtest-12.uvt.internal {
<br>
device /dev/drbd1;
<br>
disk "/dev/vdb";
<br>
address 10.0.11.202:7790;
<br>
flexible-meta-disk internal;
<br>
}
<br>
}
<br>
<br>
How to reproduce:
<br>
After create-md, connect, primary etc.:
<br>
<br>
On drtest-11:
<br>
pvcreate /dev/drbd1
<br>
vgcreate test /dev/drbd1
<br>
lvcreate -n t1 -L20g test
<br>
lvcreate -n t2 -L20g test
<br>
mkfs.ext4 /dev/test/t1
<br>
mount /dev/test/t1 /mnt/t1
<br>
mkdir -m 777 /mnt/t1/test
<br>
while true ; do bonnie++ -u nobody -d /mnt/t1/test/ -n 8192
-s8192 ; done
<br>
<br>
On drtest-12
<br>
vgchange -aly
<br>
mkfs.ext4 /dev/test/t2
<br>
mount /dev/test/t2 /mnt/t2
<br>
mkdir -m 777 /mnt/t2/test
<br>
while true ; do bonnie++ -u nobody -d /mnt/t2/test/ -n 8192
-s8192 ; done
<br>
drbdadm detach r1
<br>
<br>
After the detach on drtest-12, drtest-11 almost instantly starts
flooding the log with " al_complete_io() called on inactive
extent" and "LOGIC BUG for enr=" and crashes within couple of
minutes.
<br>
<br>
Thanks in advance for investigating this issue.
<br>
Sincerely,
<br>
<br>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<pre class="moz-quote-pre" wrap="">_______________________________________________
drbd-user mailing list
<a class="moz-txt-link-abbreviated" href="mailto:drbd-user@lists.linbit.com">drbd-user@lists.linbit.com</a>
<a class="moz-txt-link-freetext" href="http://lists.linbit.com/mailman/listinfo/drbd-user">http://lists.linbit.com/mailman/listinfo/drbd-user</a>
</pre>
</blockquote>
<br>
</body>
</html>