<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<font face="Helvetica, Arial, sans-serif">Hi all,<br>
<br>
I am running DRBD 8.4.3 in a dual-pri setup. DRBD interlink is
through a 10 GbE Intel Cx520 link, which is piggy-backed between
the two Supermicro boxes. I have configured two SCST iscsi targets
from two DRBD volumes, which configured as multipathed targets on
three Oracle VM servers. Since these LUNs shall be used as
storages repositories, they are initialized as OCFS2 volumes.<br>
<br>
So the setup is like this:<br>
<br>
DRBD hosts:<br>
<br>
- Supermicro-Chassis, 32 GB RAM, 2 x 1.8 GHz Intel </font>E5-2603
1.8 GHz, 2 x LSI 9207-8i, 2x Intel
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Ohne Titel</title>
X520-T2<br>
- CentOS 6.3<br>
- DRBD 8.3.4<br>
- SCST svn 3.x<br>
- 10 GbE DRBD interconnect<br>
- 2 x 1 GbE LCAP bond for iSCSI<br>
<br>
<br>
The whole setup still lacks the pacemaker-stuff, but as of yet I did
not get around to configure it, so bare with me on that. First and
primary goal, was to ensure and test the iSCSI stuff in regards of
speed and reliability and this is exactly where I am having issues.<br>
<br>
I ran three concurrent tests from my OVM servers using fio against
one of the DRBD volumes/SCST LUNs and these tests did pass without
any issue. However, it seems that I am able to get DRBD into
trouble, if I start to bypass a certain rate of throughput, such as
that DRBD can't keep up with the concurrent/conflicting writes and
then starts to disconnect/re-connect and I am wondering what might
cause this.<br>
If that happens, /var/log/messages shows this on one host:<br>
<br>
<tt>Mar 16 17:31:17 ovmdrbd02 kernel: block drbd1: Timed out waiting
for missing ack packets; disconnecting</tt><tt><br>
</tt><tt>Mar 16 17:31:17 ovmdrbd02 kernel: d-con drbdSrvPool: error
receiving Data, e: -110 l: 126976!</tt><tt><br>
</tt><tt>Mar 16 17:31:17 ovmdrbd02 kernel: d-con drbdSrvPool: peer(
Primary -> Unknown ) conn( Connected -> ProtocolError )
pdsk( UpToDate -> DUnknown ) susp( 0 -> 1 )</tt><tt><br>
</tt><tt>Mar 16 17:31:18 ovmdrbd02 kernel: d-con drbdSrvPool:
tconn_finish_peer_reqs() failed</tt><tt><br>
</tt><tt>Mar 16 17:31:18 ovmdrbd02 kernel: d-con drbdSrvPool:
asender terminated</tt><tt><br>
</tt><tt>Mar 16 17:31:18 ovmdrbd02 kernel: d-con drbdSrvPool:
Terminating drbd_a_drbdSrvP</tt><tt><br>
</tt><tt>Mar 16 17:31:18 ovmdrbd02 kernel: d-con drbdSrvPool:
Connection closed</tt><tt><br>
</tt><tt>Mar 16 17:31:18 ovmdrbd02 kernel: d-con drbdSrvPool: helper
command: /sbin/drbdadm fence-peer drbdSrvPool</tt><tt><br>
</tt><tt>Mar 16 17:31:18 ovmdrbd02 kernel: d-con drbdSrvPool: conn(
ProtocolError -> Unconnected )</tt><tt><br>
</tt><tt>Mar 16 17:31:18 ovmdrbd02 kernel: d-con drbdSrvPool:
receiver terminated</tt><tt><br>
</tt><tt>Mar 16 17:31:18 ovmdrbd02 kernel: d-con drbdSrvPool:
Restarting receiver thread</tt><tt><br>
</tt><tt>Mar 16 17:31:18 ovmdrbd02 kernel: d-con drbdSrvPool:
receiver (re)started</tt><tt><br>
</tt><tt>Mar 16 17:31:18 ovmdrbd02 kernel: d-con drbdSrvPool: conn(
Unconnected -> WFConnection )</tt><tt><br>
</tt><tt>Mar 16 17:31:18 ovmdrbd02 crm-fence-peer.sh[6821]: invoked
for drbdSrvPool</tt><tt><br>
</tt><tt>Mar 16 17:31:18 ovmdrbd02 crm-fence-peer.sh[6821]:
/usr/lib/drbd/crm-fence-peer.sh: line 226: cibadmin: command not
found</tt><tt><br>
</tt><tt>Mar 16 17:31:18 ovmdrbd02 kernel: d-con drbdSrvPool: helper
command: /sbin/drbdadm fence-peer drbdSrvPool exit code 1 (0x100)</tt><tt><br>
</tt><tt>Mar 16 17:31:18 ovmdrbd02 kernel: d-con drbdSrvPool:
fence-peer helper broken, returned 1</tt><tt><br>
</tt><tt>Mar 16 17:31:18 ovmdrbd02 kernel: d-con drbdSrvPool:
Handshake successful: Agreed network protocol version 101</tt><tt><br>
</tt><tt>Mar 16 17:31:18 ovmdrbd02 kernel: d-con drbdSrvPool: conn(
WFConnection -> WFReportParams )</tt><tt><br>
</tt><tt>Mar 16 17:31:18 ovmdrbd02 kernel: d-con drbdSrvPool:
Starting asender thread (from drbd_r_drbdSrvP [32615])</tt><tt><br>
</tt><tt>Mar 16 17:31:18 ovmdrbd02 kernel: block drbd1:
drbd_sync_handshake:</tt><tt><br>
</tt><tt>Mar 16 17:31:18 ovmdrbd02 kernel: block drbd1: self
ED1775CE98CDC8C7:0000000000000000:D6FB036A059AFAC0:D6FA036A059AFAC1
bits:0 flags:0</tt><tt><br>
</tt><tt>Mar 16 17:31:18 ovmdrbd02 kernel: block drbd1: peer
ED1775CE98CDC8C7:0000000000000000:D6FB036A059AFAC1:D6FA036A059AFAC1
bits:0 flags:0</tt><tt><br>
</tt><tt>Mar 16 17:31:18 ovmdrbd02 kernel: block drbd1:
uuid_compare()=0 by rule 40</tt><tt><br>
</tt><tt>Mar 16 17:31:18 ovmdrbd02 kernel: block drbd1: peer(
Unknown -> Primary ) conn( WFReportParams -> Connected )
pdsk( DUnknown -> UpToDate )</tt><tt><br>
</tt><tt>Mar 16 17:31:18 ovmdrbd02 kernel: d-con drbdSrvPool: susp(
1 -> 0 )</tt><br>
<br>
and this on the other one:<br>
<br>
<tt>Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool: sock was
shut down by peer</tt><tt><br>
</tt><tt>Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool: peer(
Primary -> Unknown ) conn( Connected -> BrokenPipe ) pdsk(
UpToDate -> DUnknown ) susp( 0 -> 1 )</tt><tt><br>
</tt><tt>Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool: short
read (expected size 16)</tt><tt><br>
</tt><tt>Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool:
asender terminated</tt><tt><br>
</tt><tt>Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool:
Terminating drbd_a_drbdSrvP</tt><tt><br>
</tt><tt>Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool:
Connection closed</tt><tt><br>
</tt><tt>Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool: helper
command: /sbin/drbdadm fence-peer drbdSrvPool</tt><tt><br>
</tt><tt>Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool: conn(
BrokenPipe -> Unconnected )</tt><tt><br>
</tt><tt>Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool:
receiver terminated</tt><tt><br>
</tt><tt>Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool:
Restarting receiver thread</tt><tt><br>
</tt><tt>Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool:
receiver (re)started</tt><tt><br>
</tt><tt>Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool: conn(
Unconnected -> WFConnection )</tt><tt><br>
</tt><tt>Mar 16 17:31:18 ovmdrbd01 crm-fence-peer.sh[30246]: invoked
for drbdSrvPool</tt><tt><br>
</tt><tt>Mar 16 17:31:18 ovmdrbd01 crm-fence-peer.sh[30246]:
/usr/lib/drbd/crm-fence-peer.sh: line 226: cibadmin: command not
found</tt><tt><br>
</tt><tt>Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool: helper
command: /sbin/drbdadm fence-peer drbdSrvPool exit code 1 (0x100)</tt><tt><br>
</tt><tt>Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool:
fence-peer helper broken, returned 1</tt><tt><br>
</tt><tt>Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool:
Handshake successful: Agreed network protocol version 101</tt><tt><br>
</tt><tt>Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool: conn(
WFConnection -> WFReportParams )</tt><tt><br>
</tt><tt>Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool:
Starting asender thread (from drbd_r_drbdSrvP [24383])</tt><tt><br>
</tt><tt>Mar 16 17:31:18 ovmdrbd01 kernel: block drbd1:
drbd_sync_handshake:</tt><tt><br>
</tt><tt>Mar 16 17:31:18 ovmdrbd01 kernel: block drbd1: self
ED1775CE98CDC8C7:0000000000000000:D6FB036A059AFAC1:D6FA036A059AFAC1
bits:0 flags:0</tt><tt><br>
</tt><tt>Mar 16 17:31:18 ovmdrbd01 kernel: block drbd1: peer
ED1775CE98CDC8C7:0000000000000000:D6FB036A059AFAC0:D6FA036A059AFAC1
bits:0 flags:0</tt><tt><br>
</tt><tt>Mar 16 17:31:18 ovmdrbd01 kernel: block drbd1:
uuid_compare()=0 by rule 40</tt><tt><br>
</tt><tt>Mar 16 17:31:18 ovmdrbd01 kernel: block drbd1: peer(
Unknown -> Primary ) conn( WFReportParams -> Connected )
pdsk( DUnknown -> UpToDate )</tt><tt><br>
</tt><tt>Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool: susp(
1 -> 0 )</tt><tt><br>
</tt><br>
In this case I was writing a 12 GB file from each of the three OVM
servers onto the DRBD volume, while multipathd was setup in
multibus-mode.<br>
When I disabled one iSCSI target, this test passed without any
issue, so it must for some reason be due to the conflicting writes.
What these logs tell me, seems to be that drbd02 waits for some ack
packets from drbd01 and runs into a timeout, which normally would
now fence the peer. So, this shouldn't happen in the first place
right? It then restarts the receiver and picks up the connection
again. Alas, I can't find any trace of a network issue on the 10 GbE
connection, so I am really at a loss here.<br>
<br>
Finally, here are the DRBD and SCST configs I used:<br>
<br>
<tt>/etc/drbd.d/global_common.conf</tt><tt><br>
</tt><tt>global {</tt><tt><br>
</tt><tt> usage-count yes;</tt><tt><br>
</tt><tt>}</tt><tt><br>
</tt><tt>common {</tt><tt><br>
</tt><tt> net {</tt><tt><br>
</tt><tt> protocol C;</tt><tt><br>
</tt><tt> allow-two-primaries yes;</tt><tt><br>
</tt><tt> }</tt><tt><br>
</tt><tt>}</tt><tt><br>
</tt><br>
<tt>/etc/drbd.d/drbdSrvPool.res</tt><tt><br>
</tt><tt>resource drbdSrvPool {</tt><tt><br>
</tt><tt> startup {</tt><tt><br>
</tt><tt> become-primary-on both;</tt><tt><br>
</tt><tt> }</tt><tt><br>
</tt><tt><br>
</tt><tt> net {</tt><tt><br>
</tt><tt> sndbuf-size 0;</tt><tt><br>
</tt><tt> protocol C;</tt><tt><br>
</tt><tt> allow-two-primaries yes;</tt><tt><br>
</tt><tt> after-sb-0pri discard-zero-changes;</tt><tt><br>
</tt><tt> after-sb-1pri discard-secondary;</tt><tt><br>
</tt><tt> after-sb-2pri disconnect;</tt><tt><br>
</tt><tt> }</tt><tt><br>
</tt><tt><br>
</tt><tt> on ovmdrbd01 {</tt><tt><br>
</tt><tt> device /dev/drbd1;</tt><tt><br>
</tt><tt> disk /dev/ovmPool01/drbdSrvPool;</tt><tt><br>
</tt><tt> address 192.168.2.1:7789;</tt><tt><br>
</tt><tt> meta-disk internal;</tt><tt><br>
</tt><tt> }</tt><tt><br>
</tt><tt> on ovmdrbd02 {</tt><tt><br>
</tt><tt> device /dev/drbd1;</tt><tt><br>
</tt><tt> disk /dev/ovmPool01/drbdSrvPool;</tt><tt><br>
</tt><tt> address 192.168.2.2:7789;</tt><tt><br>
</tt><tt> meta-disk internal;</tt><tt><br>
</tt><tt> }</tt><tt><br>
</tt><tt><br>
</tt><tt> disk {</tt><tt><br>
</tt><tt> c-plan-ahead 0;</tt><tt><br>
</tt><tt> resync-rate 256M;</tt><tt><br>
</tt><tt> fencing resource-and-stonith;</tt><tt><br>
</tt><tt> }</tt><tt><br>
</tt><tt><br>
</tt><tt> handlers {</tt><tt><br>
</tt><tt> fence-peer "/usr/lib/drbd/crm-fence-peer.sh";</tt><tt><br>
</tt><tt> after-resync-target
"/usr/lib/drbd/crm-unfence-peer.sh";</tt><tt><br>
</tt><tt> }</tt><tt><br>
</tt><tt>}</tt><tt><br>
</tt><br>
<tt>/etc/scst.conf</tt><tt><br>
</tt><tt>HANDLER vdisk_blockio {</tt><tt><br>
</tt><tt> DEVICE drbdSrvPool {</tt><tt><br>
</tt><tt> filename /dev/drbd1</tt><tt><br>
</tt><tt> threads_num 2</tt><tt><br>
</tt><tt> nv_cache 0</tt><tt><br>
</tt><tt> write_through 1</tt><tt><br>
</tt><tt> }</tt><tt><br>
</tt><tt> DEVICE drbdVMPool01 {</tt><tt><br>
</tt><tt> filename /dev/drbd2</tt><tt><br>
</tt><tt> threads_num 2</tt><tt><br>
</tt><tt> nv_cache 0</tt><tt><br>
</tt><tt> write_through 1</tt><tt><br>
</tt><tt> }</tt><tt><br>
</tt><tt>}</tt><tt><br>
</tt><tt><br>
</tt><tt><br>
</tt><tt>TARGET_DRIVER iscsi {</tt><tt><br>
</tt><tt> enabled 1</tt><tt><br>
</tt><tt><br>
</tt><tt> TARGET iqn.2013-03.ovmdrbd02:drbdSrvPool {</tt><tt><br>
</tt><tt> LUN 0 drbdSrvPool</tt><tt><br>
</tt><tt> LUN 1 drbdVMPool01</tt><tt><br>
</tt><tt> enabled 1</tt><tt><br>
</tt><tt> }</tt><tt><br>
</tt><tt>}</tt><tt><br>
</tt><br>
Any suggestion is highly appreciated.<br>
<br>
Cheers,<br>
Stephan<br>
<br>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Ohne Titel</title>
<pre class="moz-signature" cols="72">
</pre>
</body>
</html>