[DRBD-user] Just restarting secondary causes split brain, cansomeone expain why please?

Francesco Ciocchetti francesco.ciocchetti at kyneste.com
Tue Mar 6 16:50:30 CET 2007

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


There's something wierd happenning here ... at least to me.

If you just rebooted one of the servers it should not happen a SPLIT
BRAIN but just a state change.  What i see from your logs is exactly
what happened to me that i solved by changing the sb0 value to
discard-younger-primary. Are you sure that both drbd has been started
with this option enabled?

i see that the after-sb-0pri just control the split brain when both of
the nodes are secondary, check what's the situation in your case cause
maybe you have one node that is actually primary when the split brain
occurs ... in this case the beahviour  is controlled by after-sb-1pri
and next by after-sb-2pri that in your case are "consensus" then
"disconnect" == StandAlone.

just my 2 cents

bye
Francesco

David wrote:
>
> Before the reboot, the two systems see each other and are in sync. 
> When I try to start drbd on hydrogen (who was master) after rebooting
> it I get
> hydrogen:~ # service drbd start
> Starting DRBD resources:    [ d0 s0 n0 ].
> ..........
> ***************************************************************
> DRBD's startup script waits for the peer node(s) to appear.
> - In case this node was already a degraded cluster before the
>  reboot the timeout is 0 seconds. [degr-wfc-timeout]
> - If the peer was available before the reboot the timeout will
>  expire after 0 seconds. [wfc-timeout]
>  (These values are for resource 'r0'; 0 sec -> wait forever)
> To abort waiting enter 'yes' [ 520]:
>
>
> So right away there is a problem.  The logs show drbd complaining
> about a split brain:
>
> Mar  5 17:26:49 hydrogen kernel: drbd0: conn( WFConnection ->
> WFReportParams )
> Mar  5 17:26:49 hydrogen kernel: drbd0: Handshake successful: DRBD
> Network Protocol version 86
> Mar  5 17:26:49 hydrogen kernel: drbd0: Split-Brain detected, dropping
> connection!
> Mar  5 17:26:49 hydrogen kernel: drbd0: self
> C9710AB94F619A7F:F920CFF31F2A1606:C2B9EF60E881089C:2F33912A597BE6F2
> Mar  5 17:26:49 hydrogen kernel: drbd0: peer
> CD986B54BF6D0C8B:F920CFF31F2A1607:C2B9EF60E881089D:2F33912A597BE6F2
> Mar  5 17:26:49 hydrogen kernel: drbd0: conn( WFReportParams ->
> Disconnecting )
> Mar  5 17:26:49 hydrogen kernel: drbd0: error receiving ReportState,
> l: 4!
> Mar  5 17:26:49 hydrogen kernel: drbd0: asender terminated
> Mar  5 17:26:49 hydrogen kernel: drbd0: tl_clear()
> Mar  5 17:26:49 hydrogen kernel: drbd0: Connection closed
> Mar  5 17:26:49 hydrogen kernel: drbd0: conn( Disconnecting ->
> StandAlone )
> Mar  5 17:26:49 hydrogen kernel: drbd0: receiver terminated
>
> At the same time, oxygen (now primary) is logging:
> Mar  5 17:26:49 oxygen kernel: drbd0: conn( WFConnection ->
> WFReportParams )
> Mar  5 17:26:49 oxygen kernel: drbd0: Handshake successful: DRBD
> Network Protocol version 86
> Mar  5 17:26:49 oxygen kernel: drbd0: Split-Brain detected, dropping
> connection!
> Mar  5 17:26:49 oxygen kernel: drbd0: self
> CD986B54BF6D0C8B:F920CFF31F2A1607:C2B9EF60E881089D:2F33912A597BE6F2
> Mar  5 17:26:49 oxygen kernel: drbd0: peer
> C9710AB94F619A7F:F920CFF31F2A1606:C2B9EF60E881089C:2F33912A597BE6F2
> Mar  5 17:26:49 oxygen kernel: drbd0: conn( WFReportParams ->
> Disconnecting )
> Mar  5 17:26:49 oxygen kernel: drbd0: error receiving ReportState, l: 4!
> Mar  5 17:26:49 oxygen kernel: drbd0: meta connection shut down by peer.
> Mar  5 17:26:49 oxygen kernel: drbd0: asender terminated
> Mar  5 17:26:49 oxygen kernel: drbd0: tl_clear()
> Mar  5 17:26:49 oxygen kernel: drbd0: Connection closed
> Mar  5 17:26:49 oxygen kernel: drbd0: conn( Disconnecting -> StandAlone )
> Mar  5 17:26:49 oxygen kernel: drbd0: receiver terminated
>
>
> At this point I am completely confused.  I thought hydrogen (the
> rebooted system) should see that it is out of date and become
> secondary and resync itself, instead I'm getting split brain.  The
> file system on the drbd partition is XFS and is mounted read only, so
> no one is writing to partition before, during or after the reboot of
> hydrogen.
>
> Is there a way to print the metadata line (like the one you see in the
> logs) manually, I'd like to see if it matches before and after
> reboot.  Maybe something is altering the data during shutdown or bootup?
>
> Francesco Ciocchetti wrote:
>> is DRBD correctly starting on hydrogen? do you have session established
>> beetween nodes (it does not seem so).
>> what about the logs? there is something there that can justify a
>> situation like this?
>> what if you try to force connection and primary state or to
>> invalidate peer?
>>
>> bye
>> Francesco
>>
>> David wrote:
>>  
>>> Francesco Ciocchetti wrote:
>>>    
>>>> I' ve a newbie about DRBD but i experienced a problem like your
>>>> one. In
>>>> my case the problem was the setting of the following configuration
>>>> instructions:
>>>>
>>>> I had to change the first one to this value to be able to regain from
>>>> the SB.
>>>>
>>>>
>>>> after-sb-0pri discard-younger-primary;
>>>> after-sb-1pri consensus;
>>>> after-sb-2pri disconnect;
>>>>
>>>>
>>>> Bye
>>>>
>>>> David wrote:
>>>>  
>>>>      
>>>>> Before reboot:
>>>>>
>>>>> hydrogen:/etc/ha.d # cat /proc/drbd
>>>>> version: 8.0.1 (api:86/proto:86)
>>>>> SVN Revision: 2784 build by root at hydrogen, 2007-03-05 08:47:01
>>>>> 0: cs:Connected st:Primary/Secondary ds:UpToDate/UpToDate C r---
>>>>> ns:264 nr:0 dw:256 dr:580 al:0 bm:2 lo:0 pe:0 ua:0 ap:0
>>>>> resync: used:0/31 hits:20 misses:2 starving:0 dirty:0 changed:2
>>>>> act_log: used:0/127 hits:25 misses:0 starving:0 dirty:0 changed:0
>>>>>
>>>>> oxygen:~ # cat /proc/drbd
>>>>> version: 8.0.1 (api:86/proto:86)
>>>>> SVN Revision: 2784 build by root at oxygen, 2007-03-05 08:43:02
>>>>> 0: cs:Connected st:Secondary/Primary ds:UpToDate/UpToDate C r---
>>>>> ns:0 nr:264 dw:264 dr:0 al:0 bm:2 lo:0 pe:0 ua:0 ap:0
>>>>> resync: used:0/31 hits:20 misses:2 starving:0 dirty:0 changed:2
>>>>> act_log: used:0/127 hits:0 misses:0 starving:0 dirty:0 changed:0
>>>>>
>>>>>
>>>>> During hydrogen reboot:
>>>>> oxygen:~ # cat /proc/drbd
>>>>> version: 8.0.1 (api:86/proto:86)
>>>>> SVN Revision: 2784 build by root at oxygen, 2007-03-05 08:43:02
>>>>> 0: cs:WFConnection st:Primary/Unknown ds:UpToDate/DUnknown C r---
>>>>> ns:0 nr:264 dw:520 dr:316 al:0 bm:2 lo:0 pe:0 ua:0 ap:0
>>>>> resync: used:0/31 hits:20 misses:2 starving:0 dirty:0 changed:2
>>>>> act_log: used:0/127 hits:25 misses:0 starving:0 dirty:0 changed:0
>>>>>
>>>>>
>>>>> Started drbd (no heartbeat) on hydrogen
>>>>> oxygen:~ # cat /proc/drbd
>>>>> version: 8.0.1 (api:86/proto:86)
>>>>> SVN Revision: 2784 build by root at oxygen, 2007-03-05 08:43:02
>>>>> 0: cs:StandAlone st:Primary/Unknown ds:UpToDate/DUnknown r---
>>>>> ns:0 nr:264 dw:520 dr:316 al:0 bm:2 lo:0 pe:0 ua:0 ap:0
>>>>> resync: used:0/31 hits:20 misses:2 starving:0 dirty:0 changed:2
>>>>> act_log: used:0/127 hits:25 misses:0 starving:0 dirty:0 changed:0
>>>>>
>>>>> On hydrogen, I'm seeing:
>>>>> hydrogen:~ # service drbd start
>>>>> Starting DRBD resources: [ d0 s0 n0 ].
>>>>> ..........
>>>>> ***************************************************************
>>>>> DRBD's startup script waits for the peer node(s) to appear.
>>>>> - In case this node was already a degraded cluster before the
>>>>> reboot the timeout is 0 seconds. [degr-wfc-timeout]
>>>>> - If the peer was available before the reboot the timeout will
>>>>> expire after 0 seconds. [wfc-timeout]
>>>>> (These values are for resource 'r0'; 0 sec -> wait forever)
>>>>> To abort waiting enter 'yes' [ 520]:
>>>>>
>>>>>
>>>>>
>>>>> So just starting drbd on hydrogen causes a split brain and oxygen,
>>>>> now
>>>>> the primary, to go into a standalone state. Why is that? The file
>>>>> system is mounted as a read only file system so no changes should be
>>>>> taking place. This is not a primary/primary setup so there is only
>>>>> one
>>>>> "active" node at a time. I was under the impression that the
>>>>> rebooting
>>>>> node, hydrogen, should see that it is out of date and become
>>>>> secondary, resync itself with the primary and stay in the secondary
>>>>> state until that is changed? Am I wrong?
>>>>>
>>>>> Both systems are identical:
>>>>> SLES 10
>>>>> kernel 2.6.16.27-0.9-bigsmp
>>>>> drbd 8.0.1 compiled from source
>>>>>
>>>>>
>>>>> Thanks ahead,
>>>>>
>>>>> David
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> drbd-user mailing list
>>>>> drbd-user at lists.linbit.com
>>>>> http://lists.linbit.com/mailman/listinfo/drbd-user
>>>>>
>>>>>             
>>>>         
>>> Thanks for the response.  I'm currently using the settings you talk
>>> about (sorry, should have included this before):
>>>
>>> resource r0 {
>>>
>>>  protocol C;
>>>
>>>  net {
>>>    after-sb-0pri discard-younger-primary;
>>>    after-sb-1pri consensus;
>>>    after-sb-2pri disconnect;
>>>  }
>>>
>>>  syncer {
>>>    rate 120M;
>>>  }
>>>
>>>  on hydrogen {
>>>    device     /dev/drbd0;
>>>    disk       /dev/sda4;
>>>    address    172.16.0.2:7788;
>>>    meta-disk  /dev/sda3[0];
>>>  }
>>>
>>>  on oxygen {
>>>    device     /dev/drbd0;
>>>    disk       /dev/sda4;
>>>    address    172.16.0.1:7788;
>>>    meta-disk  /dev/sda3[0];
>>>  }
>>> }
>>> _______________________________________________
>>> drbd-user mailing list
>>> drbd-user at lists.linbit.com
>>> http://lists.linbit.com/mailman/listinfo/drbd-user
>>>
>>>     
>>
>>   
>
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user
>

-- 
==============================================
Francesco Ciocchetti (francesco.ciocchetti at kyneste.com)

Network & Security
Kyneste - Piattaforme ICT
"Società appartenente al Gruppo Bancario Capitalia e
sottoposta all’attività di direzione e coordinamento di Capitalia S.p.A."
Via Mario Bianchini 68, 00142 Roma
T.[+39] 06.98402.1 F. [+39] 06.98402.300
http://www.kyneste.com/
==============================================

Questo messaggio può contenere informazioni di carattere riservato e confidenziale. Qualora non foste i destinatari, vogliate immediatamente informarci con lo stesso mezzo ed eliminare il  messaggio, con gli eventuali allegati, senza trattenerne copia. Qualsivoglia utilizzo non autorizzato del contenuto di questo messaggio costituisce violazione dell'obbligo di non prendere cognizione della corrispondenza tra altri soggetti.




More information about the drbd-user mailing list