[DRBD-user] drbd/hearbeat leads to brain-split

Wed Oct 17 17:47:55 CEST 2007

I have the same problem on a similar setup, the only difference being that I 
run NFS on top instead of iscsi. When I take one of the nodes down with a 
hard reboot I get a split brain that I cannot resolve. If I try to manually 
load the module on th failed node and do `drbdadm -- --discard-my-data 
connect all` I get the foolowing in /proc/drbd:

version: 8.0pre5 (api:84/proto:83)
SVN Revision: 2481M build by root@, 2007-10-16 08:52:04
 0: cs:Connected st:Secondary/Primary ds:Diskless/UpToDate r---
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0

... which suggest that the system in Diskless. 

How do I recover from that state without doing full sync and without stopping 
the working node? 

On Wednesday 17 October 2007 17:52:34 Rudolph Bott wrote:
> Hey list,
>
> I'm currently trying to setup a drbd/iscsi/heartbeat-environment. This
> seems to work out quite well:
>
> - two boxes with drbd storage
> - heartbeat to monitor links and to start iscsi-target etc.
>
> Upon failure of the primary node, heartbeat starts iscsi-target on the
> secondary and makes it the primary node. This works great even while
> clients are writing to the filesystem via iscsi - the iscsi initiator
> simply reconnects to the cluster-IP and continues writing.
>
>
> But as soon as node1 comes back, both nodes complain about a "split brain"
> situation and refuses to resync - although nothing has been written to the
> device on node1 since its disconnect! Shouldn't heartbeat handle this
> situation? On top of that, I'm not able to resync the devices without
> playing around with various drbdadm commands (including taking both sides
> down completely - which would not be an acceptable solution in a production
> environment)
>
> Both sides are equally configured debian etch systems, using heartbeat v2
> packages and drbd8-packets from backports.org: version: 8.0.4
> (api:86/proto:86)
> SVN Revision: 2947 build by root at nas02, 2007-10-16 13:43:43
>
> Please note that the two systems do NOT have equal hardware componentes -
> it's just a test environment with different storage capabilities (~230GB
> vs. ~60GB)
>
> below is a log extract and my heartbeat configuration:
>
> dmesg output from nas02 (after nas01 has been started again):
> r8169: eth1: link up
> drbd0: conn( WFConnection -> WFReportParams )
> drbd0: Handshake successful: DRBD Network Protocol version 86
> drbd0: Considerable difference in lower level device sizes: 121720712s vs.
> 455185536s drbd0: Split-Brain detected, dropping connection!
> drbd0: self
> C765DF24A2676E31:DDFCF96A854616A7:B34DB3CFA71F57C2:15641BAC6660D448 drbd0:
> peer E785C3CFDDDC5C05:DDFCF96A854616A7:B34DB3CFA71F57C2:15641BAC6660D448
> drbd0: conn( WFReportParams -> Disconnecting )
> drbd0: error receiving ReportState, l: 4!
> drbd0: meta connection shut down by peer.
> drbd0: asender terminated
> drbd0: tl_clear()
> drbd0: Connection closed
> drbd0: conn( Disconnecting -> StandAlone )
> drbd0: receiver terminated
>
> ha.cf (taken from nas02, besides the ucast-settings thay are identical):
> logfacility     daemon
> keepalive 1
> deadtime 10
> initdead 60
> ucast eth1      172.16.15.1
> ucast eth0      192.168.0.30
> auto_failback off
> node nas01
> node nas02
>
> haresources:
> nas01 192.168.0.34 drbddisk::storage iscsi-target
> MailTo::technik at megabit.net
>
> and here is a quick "drawing" of the whole layout:
>
>                  iSCSI initiators
>
>        /--------------Switch------------\
>
> 192.168.0.30                      192.168.0.31
>    nas01     ClusterIP 192.168.0.34    nas02
> 172.16.15.1                       172.16.15.2
>
>       \-----------CrossOver Link--------/
>
> Mit freundlichen Grüßen / with kind regards
>
> Rudolph Bott