<HTML>
<HEAD>
<TITLE>Problem with disk errors</TITLE>
</HEAD>
<BODY>
<FONT FACE="Calibri, Verdana, Helvetica, Arial"><SPAN STYLE='font-size:11pt'>I have had a strange problem with DRBD 8.3 twice. The server running as a secondary had a disk problem and went diskless. The primary then saw the secondary was diskless and showed the transition for the secondary from UpToDate to Diskless. However, the primary still had problems with timeouts. My question is what do I need to do to allow the primary to run after the secondary has a disk problem? <BR>
<BR>
My drbd.conf is as follows:<BR>
<BR>
global {<BR>
usage-count yes;<BR>
}<BR>
<BR>
common {<BR>
syncer { rate 25M; }<BR>
}<BR>
<BR>
<BR>
resource drbd0 {<BR>
<BR>
protocol C;<BR>
<BR>
handlers {<BR>
pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f";<BR>
pri-lost-after-sb "echo o > /proc/sysrq-trigger ; halt -f";<BR>
local-io-error "echo o > /proc/sysrq-trigger ; halt -f";<BR>
outdate-peer "/usr/lib64/heartbeat/drbd-peer-outdater";<BR>
}<BR>
<BR>
startup {<BR>
degr-wfc-timeout 120; # 2 minutes.<BR>
}<BR>
<BR>
disk {<BR>
on-io-error detach;<BR>
}<BR>
<BR>
net {<BR>
max-buffers 2048;<BR>
max-epoch-size 2048;<BR>
after-sb-0pri disconnect;<BR>
after-sb-1pri disconnect;<BR>
after-sb-2pri disconnect;<BR>
rr-conflict disconnect;<BR>
}<BR>
<BR>
syncer {<BR>
rate 25M;<BR>
al-extents 257;<BR>
}<BR>
<BR>
on bg-host-m1 {<BR>
device /dev/drbd0;<BR>
disk /dev/sdb2;<BR>
address 172.20.0.1:7788;<BR>
meta-disk /dev/sdb1[0];<BR>
}<BR>
<BR>
on bg-host-m2 {<BR>
device /dev/drbd0;<BR>
disk /dev/sdb2;<BR>
address 172.20.0.2:7788;<BR>
meta-disk /dev/sdb1[0];<BR>
}<BR>
}<BR>
<BR>
<BR>
<BR>
Here is a small part of /var/log/messages on the primary. Also, please note the Concurrent local write message:<BR>
<BR>
Jan 29 11:10:32 bg-host-m2 iscsi_trgt: Logical Unit Reset (05) issued on tid:3 l<BR>
un:2 by sid:1127000341282880 (Function Complete)<BR>
Jan 29 11:10:32 bg-host-m2 drbd0: istiod3[22036] Concurrent local write detected<BR>
! [DISCARD L] new: 1471221576s +2048; pending: 1471221576s +2048<BR>
Jan 29 11:10:32 bg-host-m2 drbd0: istiod3[22036] Concurrent local write detected<BR>
! [DISCARD L] new: 3606015618s +19968; pending: 3606015618s +19968<BR>
Jan 29 11:10:57 bg-host-m2 iscsi_trgt: Logical Unit Reset (05) issued on tid:3 l<BR>
un:2 by sid:1127000341282880 (Function Complete)<BR>
Jan 29 11:10:59 bg-host-m2 ntpd[6722]: kernel time sync status change 4001Jan 29 11:11:06 bg-host-m2 drbd0: Got NegAck packet. Peer is in troubles?<BR>
Jan 29 11:11:06 bg-host-m2 drbd0: Got NegAck packet. Peer is in troubles?Jan 29 11:11:06 bg-host-m2 drbd0: pdsk( UpToDate -> Diskless )<BR>
Jan 29 11:11:06 bg-host-m2 drbd0: Creating new current UUIDJan 29 11:11:06 bg-host-m2 drbd0: Got NegAck packet. Peer is in troubles?<BR>
<BR>
Later on the primary:<BR>
an 29 11:11:06 bg-host-m2 drbd0: istiod3[22035] Concurrent local write detected<BR>
! [DISCARD L] new: 3617118144s +3584; pending: 3617118144s +3584<BR>
Jan 29 11:12:05 bg-host-m2 nrpe[18563]: Could not read request from client, bail<BR>
ing out...<BR>
Jan 29 11:12:18 bg-host-m2 INFO: task istiod5:22060 blocked for more than 120 seconds.<BR>
Jan 29 11:12:18 bg-host-m2 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.<BR>
Jan 29 11:12:18 bg-host-m2 istiod5 D 0000000000000000 0 22060 2Jan 29 11:12:18 bg-host-m2 ffff88006a523d30 0000000000000046 0000000000000806 ff<BR>
ffffffa001764eJan 29 11:12:18 bg-host-m2 ffff88006a51abb0 ffff88006a51b140 ffff88006a51ade0 00<BR>
000001a0018206<BR>
Jan 29 11:12:18 bg-host-m2 0000000000000246 ffff88012e5f9c80 ffff88012e379800 ffff88012c0237f0<BR>
Jan 29 11:12:18 bg-host-m2 Call Trace:<BR>
Jan 29 11:12:18 bg-host-m2 [<ffffffffa001764e>] megasas_make_sgl64+0x46/0x59 [megaraid_sas]<BR>
<BR>
<BR>
Here is the secondary:<BR>
an 29 11:11:06 bg-host-m1 sd 4:0:0:0: [sdb] Device not ready: ASC=0x4 ASCQ=0x0<BR>
Jan 29 11:11:06 bg-host-m1 end_request: I/O error, dev sdb, sector 1216507571<BR>
Jan 29 11:11:06 bg-host-m1 drbd0: disk( UpToDate -> Failed ) <BR>
Jan 29 11:11:06 bg-host-m1 drbd0: Local IO failed. Detaching...<BR>
Jan 29 11:11:06 bg-host-m1 drbd0: disk( Failed -> Diskless ) <BR>
Jan 29 11:11:06 bg-host-m1 drbd0: Notified peer that my disk is broken.<BR>
<BR>
Then later on the secondary:<BR>
Jan 29 11:13:29 bg-host-m1 INFO: task drbd0_worker:32651 blocked for more than 120 seconds.<BR>
Jan 29 11:13:29 bg-host-m1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.<BR>
Jan 29 11:13:29 bg-host-m1 drbd0_worker D 000000000000000a 0 32651 2<BR>
Jan 29 11:13:29 bg-host-m1 ffff88008f589e10 0000000000000046 ffff8801088f0000 0000000000000000<BR>
<BR>
[lots more info deleted from this event]<BR>
<BR>
This keeps repeating<BR>
<BR>
-- <BR>
Terry Hull<BR>
Network Resource Group, Inc. President<BR>
<BR>
</SPAN></FONT>
</BODY>
</HTML>