[DRBD-user] Just restarting secondary causes split brain,can someone expain why please?

Tue Mar 6 17:30:47 CET 2007

Weird is definitely one way to describe it.  I just ran other test, 
oxygen was primary, hydrogen was secondary:

oxygen:/etc/ha.d # cat /proc/drbd
version: 8.0.1 (api:86/proto:86)
SVN Revision: 2784 build by root at oxygen, 2007-03-05 08:43:02
 0: cs:Connected st:Primary/Secondary ds:UpToDate/UpToDate C r---
    ns:520456 nr:256 dw:521490 dr:521377 al:0 bm:188 lo:0 pe:0 ua:0 ap:0
        resync: used:0/31 hits:64942 misses:170 starving:0 dirty:0 
changed:170
        act_log: used:0/127 hits:52 misses:0 starving:0 dirty:0 changed:0
oxygen:/etc/ha.d # drbdsetup r0 get-gi
05BB4DA9C5CC0319:0000000000000000:BDFD6CFC7D6A6454:0A2D9F22E7800B3B:1:1:1:1:0:0

hydrogen:~ # cat /proc/drbd
version: 8.0.1 (api:86/proto:86)
SVN Revision: 2784 build by root at hydrogen, 2007-03-05 08:47:01
 0: cs:Connected st:Secondary/Primary ds:UpToDate/UpToDate C r---
    ns:0 nr:264 dw:264 dr:0 al:0 bm:2 lo:0 pe:0 ua:0 ap:0
        resync: used:0/31 hits:20 misses:2 starving:0 dirty:0 changed:2
        act_log: used:0/127 hits:0 misses:0 starving:0 dirty:0 changed:0
hydrogen:~ # drbdsetup r0 get-gi
05BB4DA9C5CC0318:0000000000000000:BDFD6CFC7D6A6454:0A2D9F22E7800B3B:1:1:0:1:0:0

I rebooted oxygen, when I start drbd (service drbd start) I end up with:

oxygen:~ # service drbd start
Starting DRBD resources:    [ d0 s0 n0 ].
..........
***************************************************************
 DRBD's startup script waits for the peer node(s) to appear.
 - In case this node was already a degraded cluster before the
   reboot the timeout is 0 seconds. [degr-wfc-timeout]
 - If the peer was available before the reboot the timeout will
   expire after 0 seconds. [wfc-timeout]
   (These values are for resource 'r0'; 0 sec -> wait forever)
 To abort waiting enter 'yes' [ 461]:yes

(it sits there until I tell it to abort)

And hydrogen gives me:
hydrogen:~ # cat /proc/drbd
version: 8.0.1 (api:86/proto:86)
SVN Revision: 2784 build by root at hydrogen, 2007-03-05 08:47:01
 0: cs:StandAlone st:Primary/Unknown ds:UpToDate/DUnknown   r---
    ns:0 nr:264 dw:520 dr:316 al:0 bm:2 lo:0 pe:0 ua:0 ap:0
        resync: used:0/31 hits:20 misses:2 starving:0 dirty:0 changed:2
        act_log: used:0/127 hits:25 misses:0 starving:0 dirty:0 changed:0
hydrogen:~ # drbdsetup r0 get-gi
27CB9D1E08B38869:05BB4DA9C5CC0318:BDFD6CFC7D6A6454:0A2D9F22E7800B3B:1:1:1:0:0:0

So it looks like something is not working, but I have no clue anymore as 
to what it could be.  I'm temped to go back to the last 0.7.x version 
seeing as I don't need primary/primary (and I could get it to work).

Thanks for you time and help.

David

Francesco Ciocchetti wrote:
> There's something wierd happenning here ... at least to me.
>
> If you just rebooted one of the servers it should not happen a SPLIT
> BRAIN but just a state change.  What i see from your logs is exactly
> what happened to me that i solved by changing the sb0 value to
> discard-younger-primary. Are you sure that both drbd has been started
> with this option enabled?
>
> i see that the after-sb-0pri just control the split brain when both of
> the nodes are secondary, check what's the situation in your case cause
> maybe you have one node that is actually primary when the split brain
> occurs ... in this case the beahviour  is controlled by after-sb-1pri
> and next by after-sb-2pri that in your case are "consensus" then
> "disconnect" == StandAlone.
>
> just my 2 cents
>
> bye
> Francesco
>
> David wrote:
>   
>> Before the reboot, the two systems see each other and are in sync. 
>> When I try to start drbd on hydrogen (who was master) after rebooting
>> it I get
>> hydrogen:~ # service drbd start
>> Starting DRBD resources:    [ d0 s0 n0 ].
>> ..........
>> ***************************************************************
>> DRBD's startup script waits for the peer node(s) to appear.
>> - In case this node was already a degraded cluster before the
>>  reboot the timeout is 0 seconds. [degr-wfc-timeout]
>> - If the peer was available before the reboot the timeout will
>>  expire after 0 seconds. [wfc-timeout]
>>  (These values are for resource 'r0'; 0 sec -> wait forever)
>> To abort waiting enter 'yes' [ 520]:
>>
>>
>> So right away there is a problem.  The logs show drbd complaining
>> about a split brain:
>>
>> Mar  5 17:26:49 hydrogen kernel: drbd0: conn( WFConnection ->
>> WFReportParams )
>> Mar  5 17:26:49 hydrogen kernel: drbd0: Handshake successful: DRBD
>> Network Protocol version 86
>> Mar  5 17:26:49 hydrogen kernel: drbd0: Split-Brain detected, dropping
>> connection!
>> Mar  5 17:26:49 hydrogen kernel: drbd0: self
>> C9710AB94F619A7F:F920CFF31F2A1606:C2B9EF60E881089C:2F33912A597BE6F2
>> Mar  5 17:26:49 hydrogen kernel: drbd0: peer
>> CD986B54BF6D0C8B:F920CFF31F2A1607:C2B9EF60E881089D:2F33912A597BE6F2
>> Mar  5 17:26:49 hydrogen kernel: drbd0: conn( WFReportParams ->
>> Disconnecting )
>> Mar  5 17:26:49 hydrogen kernel: drbd0: error receiving ReportState,
>> l: 4!
>> Mar  5 17:26:49 hydrogen kernel: drbd0: asender terminated
>> Mar  5 17:26:49 hydrogen kernel: drbd0: tl_clear()
>> Mar  5 17:26:49 hydrogen kernel: drbd0: Connection closed
>> Mar  5 17:26:49 hydrogen kernel: drbd0: conn( Disconnecting ->
>> StandAlone )
>> Mar  5 17:26:49 hydrogen kernel: drbd0: receiver terminated
>>
>> At the same time, oxygen (now primary) is logging:
>> Mar  5 17:26:49 oxygen kernel: drbd0: conn( WFConnection ->
>> WFReportParams )
>> Mar  5 17:26:49 oxygen kernel: drbd0: Handshake successful: DRBD
>> Network Protocol version 86
>> Mar  5 17:26:49 oxygen kernel: drbd0: Split-Brain detected, dropping
>> connection!
>> Mar  5 17:26:49 oxygen kernel: drbd0: self
>> CD986B54BF6D0C8B:F920CFF31F2A1607:C2B9EF60E881089D:2F33912A597BE6F2
>> Mar  5 17:26:49 oxygen kernel: drbd0: peer
>> C9710AB94F619A7F:F920CFF31F2A1606:C2B9EF60E881089C:2F33912A597BE6F2
>> Mar  5 17:26:49 oxygen kernel: drbd0: conn( WFReportParams ->
>> Disconnecting )
>> Mar  5 17:26:49 oxygen kernel: drbd0: error receiving ReportState, l: 4!
>> Mar  5 17:26:49 oxygen kernel: drbd0: meta connection shut down by peer.
>> Mar  5 17:26:49 oxygen kernel: drbd0: asender terminated
>> Mar  5 17:26:49 oxygen kernel: drbd0: tl_clear()
>> Mar  5 17:26:49 oxygen kernel: drbd0: Connection closed
>> Mar  5 17:26:49 oxygen kernel: drbd0: conn( Disconnecting -> StandAlone )
>> Mar  5 17:26:49 oxygen kernel: drbd0: receiver terminated
>>
>>
>> At this point I am completely confused.  I thought hydrogen (the
>> rebooted system) should see that it is out of date and become
>> secondary and resync itself, instead I'm getting split brain.  The
>> file system on the drbd partition is XFS and is mounted read only, so
>> no one is writing to partition before, during or after the reboot of
>> hydrogen.
>>
>> Is there a way to print the metadata line (like the one you see in the
>> logs) manually, I'd like to see if it matches before and after
>> reboot.  Maybe something is altering the data during shutdown or bootup?
>>
>> Francesco Ciocchetti wrote:
>>     
>>> is DRBD correctly starting on hydrogen? do you have session established
>>> beetween nodes (it does not seem so).
>>> what about the logs? there is something there that can justify a
>>> situation like this?
>>> what if you try to force connection and primary state or to
>>> invalidate peer?
>>>
>>> bye
>>> Francesco
>>>
>>> David wrote:
>>>  
>>>       
>>>> Francesco Ciocchetti wrote:
>>>>    
>>>>         
>>>>> I' ve a newbie about DRBD but i experienced a problem like your
>>>>> one. In
>>>>> my case the problem was the setting of the following configuration
>>>>> instructions:
>>>>>
>>>>> I had to change the first one to this value to be able to regain from
>>>>> the SB.
>>>>>
>>>>>
>>>>> after-sb-0pri discard-younger-primary;
>>>>> after-sb-1pri consensus;
>>>>> after-sb-2pri disconnect;
>>>>>
>>>>>
>>>>> Bye
>>>>>
>>>>> David wrote:
>>>>>  
>>>>>      
>>>>>           
>>>>>> Before reboot:
>>>>>>
>>>>>> hydrogen:/etc/ha.d # cat /proc/drbd
>>>>>> version: 8.0.1 (api:86/proto:86)
>>>>>> SVN Revision: 2784 build by root at hydrogen, 2007-03-05 08:47:01
>>>>>> 0: cs:Connected st:Primary/Secondary ds:UpToDate/UpToDate C r---
>>>>>> ns:264 nr:0 dw:256 dr:580 al:0 bm:2 lo:0 pe:0 ua:0 ap:0
>>>>>> resync: used:0/31 hits:20 misses:2 starving:0 dirty:0 changed:2
>>>>>> act_log: used:0/127 hits:25 misses:0 starving:0 dirty:0 changed:0
>>>>>>
>>>>>> oxygen:~ # cat /proc/drbd
>>>>>> version: 8.0.1 (api:86/proto:86)
>>>>>> SVN Revision: 2784 build by root at oxygen, 2007-03-05 08:43:02
>>>>>> 0: cs:Connected st:Secondary/Primary ds:UpToDate/UpToDate C r---
>>>>>> ns:0 nr:264 dw:264 dr:0 al:0 bm:2 lo:0 pe:0 ua:0 ap:0
>>>>>> resync: used:0/31 hits:20 misses:2 starving:0 dirty:0 changed:2
>>>>>> act_log: used:0/127 hits:0 misses:0 starving:0 dirty:0 changed:0
>>>>>>
>>>>>>
>>>>>> During hydrogen reboot:
>>>>>> oxygen:~ # cat /proc/drbd
>>>>>> version: 8.0.1 (api:86/proto:86)
>>>>>> SVN Revision: 2784 build by root at oxygen, 2007-03-05 08:43:02
>>>>>> 0: cs:WFConnection st:Primary/Unknown ds:UpToDate/DUnknown C r---
>>>>>> ns:0 nr:264 dw:520 dr:316 al:0 bm:2 lo:0 pe:0 ua:0 ap:0
>>>>>> resync: used:0/31 hits:20 misses:2 starving:0 dirty:0 changed:2
>>>>>> act_log: used:0/127 hits:25 misses:0 starving:0 dirty:0 changed:0
>>>>>>
>>>>>>
>>>>>> Started drbd (no heartbeat) on hydrogen
>>>>>> oxygen:~ # cat /proc/drbd
>>>>>> version: 8.0.1 (api:86/proto:86)
>>>>>> SVN Revision: 2784 build by root at oxygen, 2007-03-05 08:43:02
>>>>>> 0: cs:StandAlone st:Primary/Unknown ds:UpToDate/DUnknown r---
>>>>>> ns:0 nr:264 dw:520 dr:316 al:0 bm:2 lo:0 pe:0 ua:0 ap:0
>>>>>> resync: used:0/31 hits:20 misses:2 starving:0 dirty:0 changed:2
>>>>>> act_log: used:0/127 hits:25 misses:0 starving:0 dirty:0 changed:0
>>>>>>
>>>>>> On hydrogen, I'm seeing:
>>>>>> hydrogen:~ # service drbd start
>>>>>> Starting DRBD resources: [ d0 s0 n0 ].
>>>>>> ..........
>>>>>> ***************************************************************
>>>>>> DRBD's startup script waits for the peer node(s) to appear.
>>>>>> - In case this node was already a degraded cluster before the
>>>>>> reboot the timeout is 0 seconds. [degr-wfc-timeout]
>>>>>> - If the peer was available before the reboot the timeout will
>>>>>> expire after 0 seconds. [wfc-timeout]
>>>>>> (These values are for resource 'r0'; 0 sec -> wait forever)
>>>>>> To abort waiting enter 'yes' [ 520]:
>>>>>>
>>>>>>
>>>>>>
>>>>>> So just starting drbd on hydrogen causes a split brain and oxygen,
>>>>>> now
>>>>>> the primary, to go into a standalone state. Why is that? The file
>>>>>> system is mounted as a read only file system so no changes should be
>>>>>> taking place. This is not a primary/primary setup so there is only
>>>>>> one
>>>>>> "active" node at a time. I was under the impression that the
>>>>>> rebooting
>>>>>> node, hydrogen, should see that it is out of date and become
>>>>>> secondary, resync itself with the primary and stay in the secondary
>>>>>> state until that is changed? Am I wrong?
>>>>>>
>>>>>> Both systems are identical:
>>>>>> SLES 10
>>>>>> kernel 2.6.16.27-0.9-bigsmp
>>>>>> drbd 8.0.1 compiled from source
>>>>>>
>>>>>>
>>>>>> Thanks ahead,
>>>>>>
>>>>>> David
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> drbd-user mailing list
>>>>>> drbd-user at lists.linbit.com
>>>>>> http://lists.linbit.com/mailman/listinfo/drbd-user
>>>>>>
>>>>>>             
>>>>>>             
>>>>>         
>>>>>           
>>>> Thanks for the response.  I'm currently using the settings you talk
>>>> about (sorry, should have included this before):
>>>>
>>>> resource r0 {
>>>>
>>>>  protocol C;
>>>>
>>>>  net {
>>>>    after-sb-0pri discard-younger-primary;
>>>>    after-sb-1pri consensus;
>>>>    after-sb-2pri disconnect;
>>>>  }
>>>>
>>>>  syncer {
>>>>    rate 120M;
>>>>  }
>>>>
>>>>  on hydrogen {
>>>>    device     /dev/drbd0;
>>>>    disk       /dev/sda4;
>>>>    address    172.16.0.2:7788;
>>>>    meta-disk  /dev/sda3[0];
>>>>  }
>>>>
>>>>  on oxygen {
>>>>    device     /dev/drbd0;
>>>>    disk       /dev/sda4;
>>>>    address    172.16.0.1:7788;
>>>>    meta-disk  /dev/sda3[0];
>>>>  }
>>>> }
>>>> _______________________________________________
>>>> drbd-user mailing list
>>>> drbd-user at lists.linbit.com
>>>> http://lists.linbit.com/mailman/listinfo/drbd-user
>>>>
>>>>     
>>>>         
>>>   
>>>       
>> _______________________________________________
>> drbd-user mailing list
>> drbd-user at lists.linbit.com
>> http://lists.linbit.com/mailman/listinfo/drbd-user
>>
>>     
>
>