<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html>
<head>
<meta name="Generator" content="Zarafa WebApp v7.2.0-48204">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>High Load / IO, synching doesn't finish</title>
</head>
<body>
<p style="padding: 0; margin: 0;"><span data-mce-bogus="true" id="_mce_caret"><span data-mce-style="font-size: 10pt; font-family: tahoma,arial,helvetica,sans-serif;" style="font-size: 10pt; font-family: tahoma,arial,helvetica,sans-serif;">Hi,<br><br>we have a strange problem with DRBD V 8.4.4:<br><br>Since the weekend the load and I/O wait of the server nodes is very high. We have 8 cores and a load like this:<br><br>top - 08:52:43 up 45 days, 1:46, 2 users, load average: 95.60, 103.74, 110.63<br>Tasks: 432 total, 1 running, 431 sleeping, 0 stopped, 0 zombie<br>Cpu(s): 1.6%us, 0.9%sy, 0.0%ni, 16.2%id, 81.3%wa, 0.0%hi, 0.0%si, 0.0%st<br><br><br>With drbd-overview I saw, that two resources are syncing. I disconnected them on both nodes. Since then, the load and I/O went back to normal.<br>When I connect one of these resources the sync starts:<br><br>root@pmt-ucs02:/etc/drbd.d# drbd-overview <br> 1:vm_pmt-dc1/0 SyncTarget Secondary/Primary Inconsistent/UpToDate A r----- <br> [>....................] sync'ed: 0.8% (550444/550444)K<br> 2:vm_pmt-mail/0 StandAlone Secondary/Unknown Inconsistent/DUnknown r----- <br> 3:vm_pmt-winsrv/0 Connected Secondary/Primary UpToDate/UpToDate A r----- <br> 4:vm_pmt-erp/0 Connected Secondary/Primary UpToDate/UpToDate A r----- <br> 5:vm_pmt-dc2/0 Connected Primary/Secondary UpToDate/UpToDate A r----- <br><br>However, it never finishes. Instead the load and I/O wait raises up again until a point where the server hardly responses at all.<br>Sometimes the sync goes up to 1.5% or 3% and then it falls back to 0.8% again.<br><br>Another strange behaviour is, that the execution of the drbd-overview command always takes at least 10 seconds on one of the nodes. (no matter how low the load is)<br>The other node responds immediately.<br>I also get this output from time to time on the slow node:<br><br>root@pmt-ucs02:/etc/drbd.d# drbd-overview <br> 1:vm_pmt-dc1/0 StandAlone Secondary/Unknown Inconsistent/DUnknown r----- <br> 2:??not-found?? StandAlone Secondary/Unknown Inconsistent/DUnknown r----- <br> 3:??not-found?? Connected Secondary/Primary UpToDate/UpToDate A r----- <br> 4:??not-found?? Connected Secondary/Primary UpToDate/UpToDate A r----- <br> 5:??not-found?? Connected Primary/Secondary UpToDate/UpToDate A r----- <br><br><br>Three of the five resources are working well.</span></span></p><p style="padding: 0; margin: 0;"><span data-mce-bogus="true" id="_mce_caret"><span data-mce-style="font-size: 10pt; font-family: tahoma,arial,helvetica,sans-serif;" style="font-size: 10pt; font-family: tahoma,arial,helvetica,sans-serif;">Thanks for reading.<br>Any ideas?<br><br>Cheers,</span></span></p><p style="padding: 0; margin: 0;"><span data-mce-bogus="true" id="_mce_caret"><span data-mce-style="font-size: 10pt; font-family: tahoma,arial,helvetica,sans-serif;" style="font-size: 10pt; font-family: tahoma,arial,helvetica,sans-serif;">Roland.</span></span></p><p style="padding: 0; margin: 0;"><span data-mce-style="font-size: 10pt; font-family: tahoma,arial,helvetica,sans-serif;" style="font-size: 10pt; font-family: tahoma,arial,helvetica,sans-serif;"><br data-mce-bogus="1"></span></p><p style="padding: 0; margin: 0;"><span data-mce-style="font-size: 10pt; font-family: tahoma,arial,helvetica,sans-serif;" style="font-size: 10pt; font-family: tahoma,arial,helvetica,sans-serif;"><br data-mce-bogus="1"></span></p><p style="padding: 0; margin: 0;"><span data-mce-bogus="true" id="_mce_caret"><span data-mce-style="font-size: 10pt; font-family: tahoma,arial,helvetica,sans-serif;" style="font-size: 10pt; font-family: tahoma,arial,helvetica,sans-serif;"><br><br>Some additional information:<br><br><br><br>root@pmt-ucs02:/etc/drbd.d# cat /proc/drbd <br>version: 8.4.4 (api:1/proto:86-101)<br>GIT-hash: 905561ebc321ce0f08ed66b783e05944e733206d build by root@, 2014-08-25 18:11:11<br><br> 1: cs:StandAlone ro:Secondary/Unknown ds:Inconsistent/DUnknown r-----<br> ns:0 nr:60304 dw:470969676 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:500552<br> 2: cs:StandAlone ro:Secondary/Unknown ds:Inconsistent/DUnknown r-----<br> ns:0 nr:1080881120 dw:1080881120 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:4422884<br> 3: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate A r-----<br> ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:0<br> 4: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate A r-----<br> ns:0 nr:101852461 dw:101852461 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:0<br> 5: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate A r-----<br> ns:134208492 nr:0 dw:129141712 dr:42898984 al:2310 bm:0 lo:1 pe:0 ua:0 ap:1 ep:1 wo:d oos:0<br><br><br><br><br>root@pmt-ucs02:~# top<br><br>top - 08:52:43 up 45 days, 1:46, 2 users, load average: 95.60, 103.74, 110.63<br>Tasks: 432 total, 1 running, 431 sleeping, 0 stopped, 0 zombie<br>Cpu(s): 1.6%us, 0.9%sy, 0.0%ni, 16.2%id, 81.3%wa, 0.0%hi, 0.0%si, 0.0%st<br>Mem: 28876948k total, 18365484k used, 10511464k free, 358848k buffers<br>Swap: 10485756k total, 0k used, 10485756k free, 3086740k cached<br><br> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND <br>10258 libvirt- 20 0 8952m 8.1g 6032 S 21 29.5 6315:39 kvm <br> 708 libvirt- 20 0 8707m 2.9g 6032 S 1 10.4 307:58.90 kvm <br> 3737 root 20 0 11096 1600 912 S 1 0.0 0:01.51 top <br> 28 root 20 0 0 0 0 S 0 0.0 94:50.43 ksoftirqd/4 <br> 4390 root 20 0 19388 1692 1012 R 0 0.0 0:00.07 top <br> 1 root 20 0 10452 776 644 S 0 0.0 0:43.99 init <br> 2 root 20 0 0 0 0 S 0 0.0 0:00.56 kthreadd <br> 3 root 20 0 0 0 0 S 0 0.0 87:47.56 ksoftirqd/0 <br> 4 root 20 0 0 0 0 S 0 0.0 0:00.00 kworker/0:0 <br> 5 root 0 -20 0 0 0 S 0 0.0 0:00.00 kworker/0:0H <br> 7 root RT 0 0 0 0 S 0 0.0 0:01.52 migration/0 <br> 8 root 20 0 0 0 0 S 0 0.0 0:00.00 rcu_bh <br> 9 root 20 0 0 0 0 S 0 0.0 4:08.77 rcu_sched <br> 10 root RT 0 0 0 0 S 0 0.0 0:08.32 watchdog/0 <br> 11 root RT 0 0 0 0 S 0 0.0 0:08.60 watchdog/1 <br> 12 root RT 0 0 0 0 S 0 0.0 0:01.38 migration/1 <br> 13 root 20 0 0 0 0 S 0 0.0 88:40.54 ksoftirqd/1 <br> 15 root 0 -20 0 0 0 S 0 0.0 0:00.00 kworker/1:0H <br> 16 root RT 0 0 0 0 S 0 0.0 0:08.21 watchdog/2 <br> 17 root RT 0 0 0 0 S 0 0.0 0:01.41 migration/2 <br> 18 root 20 0 0 0 0 S 0 0.0 89:22.10 ksoftirqd/2 <br> 20 root 0 -20 0 0 0 S 0 0.0 0:00.00 kworker/2:0H <br> 21 root RT 0 0 0 0 S 0 0.0 0:07.62 watchdog/3 <br> 22 root RT 0 0 0 0 S 0 0.0 0:01.46 migration/3 <br> 23 root 20 0 0 0 0 S 0 0.0 77:02.51 ksoftirqd/3 <br> 25 root 0 -20 0 0 0 S 0 0.0 0:00.00 kworker/3:0H <br> 26 root RT 0 0 0 0 S 0 0.0 0:06.56 watchdog/4 <br> 27 root RT 0 0 0 0 S 0 0.0 0:05.56 migration/4 <br> 30 root 0 -20 0 0 0 S 0 0.0 0:00.00 kworker/4:0H <br> 31 root RT 0 0 0 0 S 0 0.0 0:07.11 watchdog/5 <br> 32 root RT 0 0 0 0 S 0 0.0 0:05.49 migration/5 <br> 33 root 20 0 0 0 0 S 0 0.0 87:29.61 ksoftirqd/5 <br><br><br><br><br><br>root@pmt-ucs01:/etc/drbd.d# cat global_common.conf<br>global {<br> usage-count yes;<br>}<br><br>common {<br> handlers {<br> # These are EXAMPLE handlers only.<br> # They may have severe implications,<br> # like hard resetting the node under certain circumstances.<br> # Be careful when chosing your poison.<br><br> # pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";<br> # pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";<br> # local-io-error "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f";<br> # fence-peer "/usr/lib/drbd/crm-fence-peer.sh";<br> # split-brain "/usr/lib/drbd/notify-split-brain.sh root";<br> # out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";<br> # before-resync-target "/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k";<br> # after-resync-target /usr/lib/drbd/unsnapshot-resync-target-lvm.sh;<br> }<br><br> startup {<br> # wfc-timeout degr-wfc-timeout outdated-wfc-timeout wait-after-sb<br> }<br><br> options {<br> # cpu-mask on-no-data-accessible<br> }<br><br> disk {<br> on-io-error detach;<br> fencing resource-only;<br> disk-flushes no;<br> md-flushes no;<br> al-extents 1237;<br> c-delay-target 20;<br> c-fill-target 0;<br> c-max-rate 150M;<br> c-min-rate 5M;<br> }<br><br> net {<br><br> max-epoch-size 16000;<br> max-buffers 16000;<br> ko-count 6;<br> cram-hmac-alg sha1;<br> shared-secret ba96f8297d8f16f0f58061f0fcc6e5d13dcaa6dd;<br> <br> verify-alg crc32c;<br><br> ## fall behind with secondary on net-congestion<br> on-congestion pull-ahead;<br> congestion-extents 800; # e.g. 2/3 of al-extends<br> congestion-fill 400M;<br><br> }<br>}<br><br><br><br>root@pmt-ucs01:/etc/drbd.d# cat vm_pmt-mail.res <br>resource vm_pmt-mail {<br> net {<br> protocol A;<br> cram-hmac-alg sha1;<br> shared-secret "FooFunFactory";<br> max-buffers 131072;<br> max-epoch-size 20000;<br> sndbuf-size 0;<br> rcvbuf-size 0;<br> verify-alg md5;<br> }<br> on pmt-ucs01 {<br> device drbd2;<br> disk /dev/vg_ucs/vm_pmt-mail;<br> meta-disk internal;<br> address 192.168.80.1:7792;<br> }<br> on pmt-ucs02 {<br> device drbd2;<br> disk /dev/vg_ucs/vm_pmt-mail;<br> meta-disk internal;<br> address 192.168.80.2:7792;<br> }<br>}<br><br></span></span></p>
</body>
</html>