[DRBD-user] drbd/hearbeat leads to brain-split

Wed Oct 17 16:52:34 CEST 2007

Hey list,

I'm currently trying to setup a drbd/iscsi/heartbeat-environment. This seems to work out quite well:

- two boxes with drbd storage
- heartbeat to monitor links and to start iscsi-target etc.

Upon failure of the primary node, heartbeat starts iscsi-target on the secondary and makes it the primary node. This works great even while clients are writing to the filesystem via iscsi - the iscsi initiator simply reconnects to the cluster-IP and continues writing.

But as soon as node1 comes back, both nodes complain about a "split brain" situation and refuses to resync - although nothing has been written to the device on node1 since its disconnect! Shouldn't heartbeat handle this situation? On top of that, I'm not able to resync the devices without playing 
around with various drbdadm commands (including taking both sides down completely - which would not be an acceptable solution in a production environment)

Both sides are equally configured debian etch systems, using heartbeat v2 packages and drbd8-packets from backports.org:
version: 8.0.4 (api:86/proto:86)
SVN Revision: 2947 build by root at nas02, 2007-10-16 13:43:43

Please note that the two systems do NOT have equal hardware componentes - it's just a test environment with different storage capabilities (~230GB vs. ~60GB)

below is a log extract and my heartbeat configuration:

dmesg output from nas02 (after nas01 has been started again):
r8169: eth1: link up
drbd0: conn( WFConnection -> WFReportParams )
drbd0: Handshake successful: DRBD Network Protocol version 86
drbd0: Considerable difference in lower level device sizes: 121720712s vs. 455185536s
drbd0: Split-Brain detected, dropping connection!
drbd0: self C765DF24A2676E31:DDFCF96A854616A7:B34DB3CFA71F57C2:15641BAC6660D448
drbd0: peer E785C3CFDDDC5C05:DDFCF96A854616A7:B34DB3CFA71F57C2:15641BAC6660D448
drbd0: conn( WFReportParams -> Disconnecting )
drbd0: error receiving ReportState, l: 4!
drbd0: meta connection shut down by peer.
drbd0: asender terminated
drbd0: tl_clear()
drbd0: Connection closed
drbd0: conn( Disconnecting -> StandAlone )
drbd0: receiver terminated

ha.cf (taken from nas02, besides the ucast-settings thay are identical):
logfacility     daemon
keepalive 1
deadtime 10
initdead 60
ucast eth1      172.16.15.1
ucast eth0      192.168.0.30
auto_failback off
node nas01
node nas02

haresources:
nas01 192.168.0.34 drbddisk::storage iscsi-target MailTo::technik at megabit.net

and here is a quick "drawing" of the whole layout:

                 iSCSI initiators
                        |
       /--------------Switch------------\
      |                                  |
192.168.0.30                      192.168.0.31
   nas01     ClusterIP 192.168.0.34    nas02
172.16.15.1                       172.16.15.2
      |                                  |
      \-----------CrossOver Link--------/

Mit freundlichen Grüßen / with kind regards

Rudolph Bott
-- 
Megabit Informationstechnik GmbH
Karstr.25  41068 Moenchengladbach  Tel:02161/30898-0  Fax:-18
AG MG HRB 10141, GF: Dipl.-Ing. Thomas Tillig, Michael Benten