[DRBD-user] Dual-primary setup: disconnecting

Sun Mar 17 08:19:26 CET 2013

Hi all,

I am running DRBD 8.4.3 in a dual-pri setup. DRBD interlink is through a 
10 GbE Intel Cx520 link, which is piggy-backed between the two 
Supermicro boxes. I have configured two SCST iscsi targets from two DRBD 
volumes, which configured as multipathed targets on three Oracle VM 
servers. Since these LUNs shall be used as storages repositories, they 
are initialized as OCFS2 volumes.

So the setup is like this:

DRBD hosts:

- Supermicro-Chassis, 32 GB RAM, 2 x 1.8 GHz Intel E5-2603 1.8 GHz, 2 x 
LSI 9207-8i, 2x Intel Ohne Titel X520-T2
- CentOS 6.3
- DRBD 8.3.4
- SCST svn 3.x
- 10 GbE DRBD interconnect
- 2 x 1 GbE LCAP bond for iSCSI

The whole setup still lacks the pacemaker-stuff, but as of yet I did not 
get around to configure it, so bare with me on that. First and primary 
goal, was to ensure and test the iSCSI stuff in regards of speed and 
reliability and this is exactly where I am having issues.

I ran three concurrent tests from my OVM servers using fio against one 
of the DRBD volumes/SCST LUNs and these tests did pass without any 
issue. However, it seems that I am able to get DRBD into trouble, if I 
start to bypass a certain rate of throughput, such as that DRBD can't 
keep up with the concurrent/conflicting writes and then starts to 
disconnect/re-connect and I am wondering what might cause this.
If that happens, /var/log/messages shows this on one host:

Mar 16 17:31:17 ovmdrbd02 kernel: block drbd1: Timed out waiting for 
missing ack packets; disconnecting
Mar 16 17:31:17 ovmdrbd02 kernel: d-con drbdSrvPool: error receiving 
Data, e: -110 l: 126976!
Mar 16 17:31:17 ovmdrbd02 kernel: d-con drbdSrvPool: peer( Primary -> 
Unknown ) conn( Connected -> ProtocolError ) pdsk( UpToDate -> DUnknown 
) susp( 0 -> 1 )
Mar 16 17:31:18 ovmdrbd02 kernel: d-con drbdSrvPool: 
tconn_finish_peer_reqs() failed
Mar 16 17:31:18 ovmdrbd02 kernel: d-con drbdSrvPool: asender terminated
Mar 16 17:31:18 ovmdrbd02 kernel: d-con drbdSrvPool: Terminating 
drbd_a_drbdSrvP
Mar 16 17:31:18 ovmdrbd02 kernel: d-con drbdSrvPool: Connection closed
Mar 16 17:31:18 ovmdrbd02 kernel: d-con drbdSrvPool: helper command: 
/sbin/drbdadm fence-peer drbdSrvPool
Mar 16 17:31:18 ovmdrbd02 kernel: d-con drbdSrvPool: conn( ProtocolError 
-> Unconnected )
Mar 16 17:31:18 ovmdrbd02 kernel: d-con drbdSrvPool: receiver terminated
Mar 16 17:31:18 ovmdrbd02 kernel: d-con drbdSrvPool: Restarting receiver 
thread
Mar 16 17:31:18 ovmdrbd02 kernel: d-con drbdSrvPool: receiver (re)started
Mar 16 17:31:18 ovmdrbd02 kernel: d-con drbdSrvPool: conn( Unconnected 
-> WFConnection )
Mar 16 17:31:18 ovmdrbd02 crm-fence-peer.sh[6821]: invoked for drbdSrvPool
Mar 16 17:31:18 ovmdrbd02 crm-fence-peer.sh[6821]: 
/usr/lib/drbd/crm-fence-peer.sh: line 226: cibadmin: command not found
Mar 16 17:31:18 ovmdrbd02 kernel: d-con drbdSrvPool: helper command: 
/sbin/drbdadm fence-peer drbdSrvPool exit code 1 (0x100)
Mar 16 17:31:18 ovmdrbd02 kernel: d-con drbdSrvPool: fence-peer helper 
broken, returned 1
Mar 16 17:31:18 ovmdrbd02 kernel: d-con drbdSrvPool: Handshake 
successful: Agreed network protocol version 101
Mar 16 17:31:18 ovmdrbd02 kernel: d-con drbdSrvPool: conn( WFConnection 
-> WFReportParams )
Mar 16 17:31:18 ovmdrbd02 kernel: d-con drbdSrvPool: Starting asender 
thread (from drbd_r_drbdSrvP [32615])
Mar 16 17:31:18 ovmdrbd02 kernel: block drbd1: drbd_sync_handshake:
Mar 16 17:31:18 ovmdrbd02 kernel: block drbd1: self 
ED1775CE98CDC8C7:0000000000000000:D6FB036A059AFAC0:D6FA036A059AFAC1 
bits:0 flags:0
Mar 16 17:31:18 ovmdrbd02 kernel: block drbd1: peer 
ED1775CE98CDC8C7:0000000000000000:D6FB036A059AFAC1:D6FA036A059AFAC1 
bits:0 flags:0
Mar 16 17:31:18 ovmdrbd02 kernel: block drbd1: uuid_compare()=0 by rule 40
Mar 16 17:31:18 ovmdrbd02 kernel: block drbd1: peer( Unknown -> Primary 
) conn( WFReportParams -> Connected ) pdsk( DUnknown -> UpToDate )
Mar 16 17:31:18 ovmdrbd02 kernel: d-con drbdSrvPool: susp( 1 -> 0 )

and this on the other one:

Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool: sock was shut down 
by peer
Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool: peer( Primary -> 
Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown ) 
susp( 0 -> 1 )
Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool: short read 
(expected size 16)
Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool: asender terminated
Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool: Terminating 
drbd_a_drbdSrvP
Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool: Connection closed
Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool: helper command: 
/sbin/drbdadm fence-peer drbdSrvPool
Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool: conn( BrokenPipe -> 
Unconnected )
Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool: receiver terminated
Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool: Restarting receiver 
thread
Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool: receiver (re)started
Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool: conn( Unconnected 
-> WFConnection )
Mar 16 17:31:18 ovmdrbd01 crm-fence-peer.sh[30246]: invoked for drbdSrvPool
Mar 16 17:31:18 ovmdrbd01 crm-fence-peer.sh[30246]: 
/usr/lib/drbd/crm-fence-peer.sh: line 226: cibadmin: command not found
Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool: helper command: 
/sbin/drbdadm fence-peer drbdSrvPool exit code 1 (0x100)
Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool: fence-peer helper 
broken, returned 1
Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool: Handshake 
successful: Agreed network protocol version 101
Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool: conn( WFConnection 
-> WFReportParams )
Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool: Starting asender 
thread (from drbd_r_drbdSrvP [24383])
Mar 16 17:31:18 ovmdrbd01 kernel: block drbd1: drbd_sync_handshake:
Mar 16 17:31:18 ovmdrbd01 kernel: block drbd1: self 
ED1775CE98CDC8C7:0000000000000000:D6FB036A059AFAC1:D6FA036A059AFAC1 
bits:0 flags:0
Mar 16 17:31:18 ovmdrbd01 kernel: block drbd1: peer 
ED1775CE98CDC8C7:0000000000000000:D6FB036A059AFAC0:D6FA036A059AFAC1 
bits:0 flags:0
Mar 16 17:31:18 ovmdrbd01 kernel: block drbd1: uuid_compare()=0 by rule 40
Mar 16 17:31:18 ovmdrbd01 kernel: block drbd1: peer( Unknown -> Primary 
) conn( WFReportParams -> Connected ) pdsk( DUnknown -> UpToDate )
Mar 16 17:31:18 ovmdrbd01 kernel: d-con drbdSrvPool: susp( 1 -> 0 )

In this case I was writing a 12 GB file from each of the three OVM 
servers onto the DRBD volume, while multipathd was setup in multibus-mode.
When I disabled one iSCSI target, this test passed without any issue, so 
it must for some reason be due to the conflicting writes. What these 
logs tell me, seems to be that drbd02 waits for some ack packets from 
drbd01 and runs into a timeout, which normally would now fence the peer. 
So, this shouldn't happen in the first place right? It then restarts the 
receiver and picks up the connection again. Alas, I can't find any trace 
of a network issue on the 10 GbE connection, so I am really at a loss here.

Finally, here are the DRBD and SCST configs I used:

/etc/drbd.d/global_common.conf
global {
   usage-count yes;
}
common {
   net {
     protocol C;
     allow-two-primaries yes;
   }
}

/etc/drbd.d/drbdSrvPool.res
resource drbdSrvPool {
      startup {
             become-primary-on both;
       }

     net {
         sndbuf-size 0;
         protocol C;
         allow-two-primaries yes;
         after-sb-0pri discard-zero-changes;
         after-sb-1pri discard-secondary;
         after-sb-2pri disconnect;
     }

     on ovmdrbd01 {
         device    /dev/drbd1;
         disk      /dev/ovmPool01/drbdSrvPool;
         address   192.168.2.1:7789;
         meta-disk internal;
     }
         on ovmdrbd02 {
         device    /dev/drbd1;
         disk      /dev/ovmPool01/drbdSrvPool;
         address   192.168.2.2:7789;
         meta-disk internal;
     }

     disk {
         c-plan-ahead 0;
         resync-rate 256M;
             fencing resource-and-stonith;
       }

     handlers {
         fence-peer    "/usr/lib/drbd/crm-fence-peer.sh";
         after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
     }
}

/etc/scst.conf
HANDLER vdisk_blockio {
     DEVICE drbdSrvPool {
         filename /dev/drbd1
         threads_num 2
         nv_cache 0
         write_through 1
     }
     DEVICE drbdVMPool01 {
         filename /dev/drbd2
         threads_num 2
         nv_cache 0
         write_through 1
     }
}

TARGET_DRIVER iscsi {
     enabled 1

     TARGET iqn.2013-03.ovmdrbd02:drbdSrvPool {
         LUN 0 drbdSrvPool
         LUN 1 drbdVMPool01
         enabled 1
     }
}

Any suggestion is highly appreciated.

Cheers,
Stephan

Ohne Titel

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20130317/cd85fe8c/attachment.htm>