Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi all,
I am running DRBD 8.4.3 in a dual-pri setup. DRBD interlink is through a
10 GbE Intel Cx520 link, which is piggy-backed between the two
Supermicro boxes. I have configured two SCST iscsi targets from two DRBD
volumes, which configured as multipathed targets on three Oracle VM
servers. Since these LUNs shall be used as storages repositories, they
are initialized as OCFS2 volumes.
So the setup is like this:
DRBD hosts:
- Supermicro-Chassis, 32 GB RAM, 2 x 1.8 GHz Intel E5-2603 1.8 GHz, 2 x
LSI 9207-8i, 2x Intel Ohne Titel X520-T2
- CentOS 6.3
- DRBD 8.3.4
- SCST svn 3.x
- 10 GbE DRBD interconnect
- 2 x 1 GbE LCAP bond for iSCSI
The whole setup still lacks the pacemaker-stuff, but as of yet I did not
get around to configure it, so bare with me on that. First and primary
goal, was to ensure and test the iSCSI stuff in regards of speed and
reliability and this is exactly where I am having issues.
I ran three concurrent tests from my OVM servers using fio against one
of the DRBD volumes/SCST LUNs and these tests did pass without any
issue. However, it seems that I am able to get DRBD into trouble, if I
start to bypass a certain rate of throughput, such as that DRBD can't
keep up with the concurrent/conflicting writes and then starts to
disconnect/re-connect and I am wondering what might cause this.
If that happens, /var/log/messages shows this on one host:
Mar 16 17:31:17 ovmdrbd02 kernel: block drbd1: Timed out waiting for
missing ack packets; disconnecting
Mar 16 17:31:17 ovmdrbd02 kernel: d-con drbdSrvPool: error receiving
Data, e: -110 l: 126976!
Mar 16 17:31:17 ovmdrbd02 kernel: d-con drbdSrvPool: peer( Primary ->
Unknown ) conn( Connected -> ProtocolError ) pdsk( UpToDate -> DUnknown
) susp( 0 -> 1 )
Mar 16 17:31:18 ovmdrbd02 kernel: d-con drbdSrvPool:
tconn_finish_peer_reqs() failed
Mar 16 17:31:18 ovmdrbd02 kernel: d-con drbdSrvPool: asender terminated
Mar 16 17:31:18 ovmdrbd02 kernel: d-con drbdSrvPool: Terminating
drbd_a_drbdSrvP
Mar 16 17:31:18 ovmdrbd02 kernel: d-con drbdSrvPool: Connection closed
Mar 16 17:31:18 ovmdrbd02 kernel: d-con drbdSrvPool: helper command:
/sbin/drbdadm fence-peer drbdSrvPool
Mar 16 17:31:18 ovmdrbd02 kernel: d-con drbdSrvPool: conn( ProtocolError
-> Unconnected )
Mar 16 17:31:18 ovmdrbd02 kernel: d-con drbdSrvPool: receiver terminated
Mar 16 17:31:18 ovmdrbd02 kernel: d-con drbdSrvPool: Restarting receiver
thread
Mar 16 17:31:18 ovmdrbd02 kernel: d-con drbdSrvPool: receiver (re)started
Mar 16 17:31:18 ovmdrbd02 kernel: d-con drbdSrvPool: conn( Unconnected
-> WFConnection )
Mar 16 17:31:18 ovmdrbd02 crm-fence-peer.sh[6821]: invoked for drbdSrvPool
Mar 16 17:31:18 ovmdrbd02 crm-fence-peer.sh[6821]:
/usr/lib/drbd/crm-fence-peer.sh: line 226: cibadmin: command not found
Mar 16 17:31:18 ovmdrbd02 kernel: d-con drbdSrvPool: helper command:
/sbin/drbdadm fence-peer drbdSrvPool exit code 1 (0x100)
Mar 16 17:31:18 ovmdrbd02 kernel: d-con drbdSrvPool: fence-peer helper
broken, returned 1
Mar 16 17:31:18 ovmdrbd02 kernel: d-con drbdSrvPool: Handshake
successful: Agreed network protocol version 101
Mar 16 17:31:18 ovmdrbd02 kernel: d-con drbdSrvPool: conn( WFConnection
-> WFReportParams )
Mar 16 17:31:18 ovmdrbd02 kernel: d-con drbdSrvPool: Starting asender
thread (from drbd_r_drbdSrvP [32615])
Mar 16 17:31:18 ovmdrbd02 kernel: block drbd1: drbd_sync_handshake:
Mar 16 17:31:18 ovmdrbd02 kernel: block drbd1: self
ED1775CE98CDC8C7:0000000000000000:D6FB036A059AFAC0:D6FA036A059AFAC1
bits:0 flags:0
Mar 16 17:31:18 ovmdrbd02 kernel: block drbd1: peer
ED1775CE98CDC8C7:0000000000000000:D6FB036A059AFAC1:D6FA036A059AFAC1
bits:0 flags:0
Mar 16 17:31:18 ovmdrbd02 kernel: block drbd1: uuid_compare()=0 by rule 40
Mar 16 17:31:18 ovmdrbd02 kernel: block drbd1: peer( Unknown -> Primary
) conn( WFReportParams -> Connected ) pdsk( DUnknown -> UpToDate )
Mar 16 17:31:18 ovmdrbd02 kernel: d-con drbdSrvPool: susp( 1 -> 0 )
and this on the other one:
Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool: sock was shut down
by peer
Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool: peer( Primary ->
Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown )
susp( 0 -> 1 )
Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool: short read
(expected size 16)
Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool: asender terminated
Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool: Terminating
drbd_a_drbdSrvP
Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool: Connection closed
Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool: helper command:
/sbin/drbdadm fence-peer drbdSrvPool
Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool: conn( BrokenPipe ->
Unconnected )
Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool: receiver terminated
Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool: Restarting receiver
thread
Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool: receiver (re)started
Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool: conn( Unconnected
-> WFConnection )
Mar 16 17:31:18 ovmdrbd01 crm-fence-peer.sh[30246]: invoked for drbdSrvPool
Mar 16 17:31:18 ovmdrbd01 crm-fence-peer.sh[30246]:
/usr/lib/drbd/crm-fence-peer.sh: line 226: cibadmin: command not found
Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool: helper command:
/sbin/drbdadm fence-peer drbdSrvPool exit code 1 (0x100)
Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool: fence-peer helper
broken, returned 1
Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool: Handshake
successful: Agreed network protocol version 101
Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool: conn( WFConnection
-> WFReportParams )
Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool: Starting asender
thread (from drbd_r_drbdSrvP [24383])
Mar 16 17:31:18 ovmdrbd01 kernel: block drbd1: drbd_sync_handshake:
Mar 16 17:31:18 ovmdrbd01 kernel: block drbd1: self
ED1775CE98CDC8C7:0000000000000000:D6FB036A059AFAC1:D6FA036A059AFAC1
bits:0 flags:0
Mar 16 17:31:18 ovmdrbd01 kernel: block drbd1: peer
ED1775CE98CDC8C7:0000000000000000:D6FB036A059AFAC0:D6FA036A059AFAC1
bits:0 flags:0
Mar 16 17:31:18 ovmdrbd01 kernel: block drbd1: uuid_compare()=0 by rule 40
Mar 16 17:31:18 ovmdrbd01 kernel: block drbd1: peer( Unknown -> Primary
) conn( WFReportParams -> Connected ) pdsk( DUnknown -> UpToDate )
Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool: susp( 1 -> 0 )
In this case I was writing a 12 GB file from each of the three OVM
servers onto the DRBD volume, while multipathd was setup in multibus-mode.
When I disabled one iSCSI target, this test passed without any issue, so
it must for some reason be due to the conflicting writes. What these
logs tell me, seems to be that drbd02 waits for some ack packets from
drbd01 and runs into a timeout, which normally would now fence the peer.
So, this shouldn't happen in the first place right? It then restarts the
receiver and picks up the connection again. Alas, I can't find any trace
of a network issue on the 10 GbE connection, so I am really at a loss here.
Finally, here are the DRBD and SCST configs I used:
/etc/drbd.d/global_common.conf
global {
usage-count yes;
}
common {
net {
protocol C;
allow-two-primaries yes;
}
}
/etc/drbd.d/drbdSrvPool.res
resource drbdSrvPool {
startup {
become-primary-on both;
}
net {
sndbuf-size 0;
protocol C;
allow-two-primaries yes;
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-2pri disconnect;
}
on ovmdrbd01 {
device /dev/drbd1;
disk /dev/ovmPool01/drbdSrvPool;
address 192.168.2.1:7789;
meta-disk internal;
}
on ovmdrbd02 {
device /dev/drbd1;
disk /dev/ovmPool01/drbdSrvPool;
address 192.168.2.2:7789;
meta-disk internal;
}
disk {
c-plan-ahead 0;
resync-rate 256M;
fencing resource-and-stonith;
}
handlers {
fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
}
}
/etc/scst.conf
HANDLER vdisk_blockio {
DEVICE drbdSrvPool {
filename /dev/drbd1
threads_num 2
nv_cache 0
write_through 1
}
DEVICE drbdVMPool01 {
filename /dev/drbd2
threads_num 2
nv_cache 0
write_through 1
}
}
TARGET_DRIVER iscsi {
enabled 1
TARGET iqn.2013-03.ovmdrbd02:drbdSrvPool {
LUN 0 drbdSrvPool
LUN 1 drbdVMPool01
enabled 1
}
}
Any suggestion is highly appreciated.
Cheers,
Stephan
Ohne Titel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20130317/cd85fe8c/attachment.htm>