Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi all, I am running DRBD 8.4.3 in a dual-pri setup. DRBD interlink is through a 10 GbE Intel Cx520 link, which is piggy-backed between the two Supermicro boxes. I have configured two SCST iscsi targets from two DRBD volumes, which configured as multipathed targets on three Oracle VM servers. Since these LUNs shall be used as storages repositories, they are initialized as OCFS2 volumes. So the setup is like this: DRBD hosts: - Supermicro-Chassis, 32 GB RAM, 2 x 1.8 GHz Intel E5-2603 1.8 GHz, 2 x LSI 9207-8i, 2x Intel Ohne Titel X520-T2 - CentOS 6.3 - DRBD 8.3.4 - SCST svn 3.x - 10 GbE DRBD interconnect - 2 x 1 GbE LCAP bond for iSCSI The whole setup still lacks the pacemaker-stuff, but as of yet I did not get around to configure it, so bare with me on that. First and primary goal, was to ensure and test the iSCSI stuff in regards of speed and reliability and this is exactly where I am having issues. I ran three concurrent tests from my OVM servers using fio against one of the DRBD volumes/SCST LUNs and these tests did pass without any issue. However, it seems that I am able to get DRBD into trouble, if I start to bypass a certain rate of throughput, such as that DRBD can't keep up with the concurrent/conflicting writes and then starts to disconnect/re-connect and I am wondering what might cause this. If that happens, /var/log/messages shows this on one host: Mar 16 17:31:17 ovmdrbd02 kernel: block drbd1: Timed out waiting for missing ack packets; disconnecting Mar 16 17:31:17 ovmdrbd02 kernel: d-con drbdSrvPool: error receiving Data, e: -110 l: 126976! Mar 16 17:31:17 ovmdrbd02 kernel: d-con drbdSrvPool: peer( Primary -> Unknown ) conn( Connected -> ProtocolError ) pdsk( UpToDate -> DUnknown ) susp( 0 -> 1 ) Mar 16 17:31:18 ovmdrbd02 kernel: d-con drbdSrvPool: tconn_finish_peer_reqs() failed Mar 16 17:31:18 ovmdrbd02 kernel: d-con drbdSrvPool: asender terminated Mar 16 17:31:18 ovmdrbd02 kernel: d-con drbdSrvPool: Terminating drbd_a_drbdSrvP Mar 16 17:31:18 ovmdrbd02 kernel: d-con drbdSrvPool: Connection closed Mar 16 17:31:18 ovmdrbd02 kernel: d-con drbdSrvPool: helper command: /sbin/drbdadm fence-peer drbdSrvPool Mar 16 17:31:18 ovmdrbd02 kernel: d-con drbdSrvPool: conn( ProtocolError -> Unconnected ) Mar 16 17:31:18 ovmdrbd02 kernel: d-con drbdSrvPool: receiver terminated Mar 16 17:31:18 ovmdrbd02 kernel: d-con drbdSrvPool: Restarting receiver thread Mar 16 17:31:18 ovmdrbd02 kernel: d-con drbdSrvPool: receiver (re)started Mar 16 17:31:18 ovmdrbd02 kernel: d-con drbdSrvPool: conn( Unconnected -> WFConnection ) Mar 16 17:31:18 ovmdrbd02 crm-fence-peer.sh[6821]: invoked for drbdSrvPool Mar 16 17:31:18 ovmdrbd02 crm-fence-peer.sh[6821]: /usr/lib/drbd/crm-fence-peer.sh: line 226: cibadmin: command not found Mar 16 17:31:18 ovmdrbd02 kernel: d-con drbdSrvPool: helper command: /sbin/drbdadm fence-peer drbdSrvPool exit code 1 (0x100) Mar 16 17:31:18 ovmdrbd02 kernel: d-con drbdSrvPool: fence-peer helper broken, returned 1 Mar 16 17:31:18 ovmdrbd02 kernel: d-con drbdSrvPool: Handshake successful: Agreed network protocol version 101 Mar 16 17:31:18 ovmdrbd02 kernel: d-con drbdSrvPool: conn( WFConnection -> WFReportParams ) Mar 16 17:31:18 ovmdrbd02 kernel: d-con drbdSrvPool: Starting asender thread (from drbd_r_drbdSrvP [32615]) Mar 16 17:31:18 ovmdrbd02 kernel: block drbd1: drbd_sync_handshake: Mar 16 17:31:18 ovmdrbd02 kernel: block drbd1: self ED1775CE98CDC8C7:0000000000000000:D6FB036A059AFAC0:D6FA036A059AFAC1 bits:0 flags:0 Mar 16 17:31:18 ovmdrbd02 kernel: block drbd1: peer ED1775CE98CDC8C7:0000000000000000:D6FB036A059AFAC1:D6FA036A059AFAC1 bits:0 flags:0 Mar 16 17:31:18 ovmdrbd02 kernel: block drbd1: uuid_compare()=0 by rule 40 Mar 16 17:31:18 ovmdrbd02 kernel: block drbd1: peer( Unknown -> Primary ) conn( WFReportParams -> Connected ) pdsk( DUnknown -> UpToDate ) Mar 16 17:31:18 ovmdrbd02 kernel: d-con drbdSrvPool: susp( 1 -> 0 ) and this on the other one: Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool: sock was shut down by peer Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool: peer( Primary -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown ) susp( 0 -> 1 ) Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool: short read (expected size 16) Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool: asender terminated Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool: Terminating drbd_a_drbdSrvP Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool: Connection closed Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool: helper command: /sbin/drbdadm fence-peer drbdSrvPool Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool: conn( BrokenPipe -> Unconnected ) Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool: receiver terminated Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool: Restarting receiver thread Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool: receiver (re)started Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool: conn( Unconnected -> WFConnection ) Mar 16 17:31:18 ovmdrbd01 crm-fence-peer.sh[30246]: invoked for drbdSrvPool Mar 16 17:31:18 ovmdrbd01 crm-fence-peer.sh[30246]: /usr/lib/drbd/crm-fence-peer.sh: line 226: cibadmin: command not found Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool: helper command: /sbin/drbdadm fence-peer drbdSrvPool exit code 1 (0x100) Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool: fence-peer helper broken, returned 1 Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool: Handshake successful: Agreed network protocol version 101 Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool: conn( WFConnection -> WFReportParams ) Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool: Starting asender thread (from drbd_r_drbdSrvP [24383]) Mar 16 17:31:18 ovmdrbd01 kernel: block drbd1: drbd_sync_handshake: Mar 16 17:31:18 ovmdrbd01 kernel: block drbd1: self ED1775CE98CDC8C7:0000000000000000:D6FB036A059AFAC1:D6FA036A059AFAC1 bits:0 flags:0 Mar 16 17:31:18 ovmdrbd01 kernel: block drbd1: peer ED1775CE98CDC8C7:0000000000000000:D6FB036A059AFAC0:D6FA036A059AFAC1 bits:0 flags:0 Mar 16 17:31:18 ovmdrbd01 kernel: block drbd1: uuid_compare()=0 by rule 40 Mar 16 17:31:18 ovmdrbd01 kernel: block drbd1: peer( Unknown -> Primary ) conn( WFReportParams -> Connected ) pdsk( DUnknown -> UpToDate ) Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool: susp( 1 -> 0 ) In this case I was writing a 12 GB file from each of the three OVM servers onto the DRBD volume, while multipathd was setup in multibus-mode. When I disabled one iSCSI target, this test passed without any issue, so it must for some reason be due to the conflicting writes. What these logs tell me, seems to be that drbd02 waits for some ack packets from drbd01 and runs into a timeout, which normally would now fence the peer. So, this shouldn't happen in the first place right? It then restarts the receiver and picks up the connection again. Alas, I can't find any trace of a network issue on the 10 GbE connection, so I am really at a loss here. Finally, here are the DRBD and SCST configs I used: /etc/drbd.d/global_common.conf global { usage-count yes; } common { net { protocol C; allow-two-primaries yes; } } /etc/drbd.d/drbdSrvPool.res resource drbdSrvPool { startup { become-primary-on both; } net { sndbuf-size 0; protocol C; allow-two-primaries yes; after-sb-0pri discard-zero-changes; after-sb-1pri discard-secondary; after-sb-2pri disconnect; } on ovmdrbd01 { device /dev/drbd1; disk /dev/ovmPool01/drbdSrvPool; address 192.168.2.1:7789; meta-disk internal; } on ovmdrbd02 { device /dev/drbd1; disk /dev/ovmPool01/drbdSrvPool; address 192.168.2.2:7789; meta-disk internal; } disk { c-plan-ahead 0; resync-rate 256M; fencing resource-and-stonith; } handlers { fence-peer "/usr/lib/drbd/crm-fence-peer.sh"; after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh"; } } /etc/scst.conf HANDLER vdisk_blockio { DEVICE drbdSrvPool { filename /dev/drbd1 threads_num 2 nv_cache 0 write_through 1 } DEVICE drbdVMPool01 { filename /dev/drbd2 threads_num 2 nv_cache 0 write_through 1 } } TARGET_DRIVER iscsi { enabled 1 TARGET iqn.2013-03.ovmdrbd02:drbdSrvPool { LUN 0 drbdSrvPool LUN 1 drbdVMPool01 enabled 1 } } Any suggestion is highly appreciated. Cheers, Stephan Ohne Titel -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20130317/cd85fe8c/attachment.htm>