Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Weird is definitely one way to describe it. I just ran other test,
oxygen was primary, hydrogen was secondary:
oxygen:/etc/ha.d # cat /proc/drbd
version: 8.0.1 (api:86/proto:86)
SVN Revision: 2784 build by root at oxygen, 2007-03-05 08:43:02
0: cs:Connected st:Primary/Secondary ds:UpToDate/UpToDate C r---
ns:520456 nr:256 dw:521490 dr:521377 al:0 bm:188 lo:0 pe:0 ua:0 ap:0
resync: used:0/31 hits:64942 misses:170 starving:0 dirty:0
changed:170
act_log: used:0/127 hits:52 misses:0 starving:0 dirty:0 changed:0
oxygen:/etc/ha.d # drbdsetup r0 get-gi
05BB4DA9C5CC0319:0000000000000000:BDFD6CFC7D6A6454:0A2D9F22E7800B3B:1:1:1:1:0:0
hydrogen:~ # cat /proc/drbd
version: 8.0.1 (api:86/proto:86)
SVN Revision: 2784 build by root at hydrogen, 2007-03-05 08:47:01
0: cs:Connected st:Secondary/Primary ds:UpToDate/UpToDate C r---
ns:0 nr:264 dw:264 dr:0 al:0 bm:2 lo:0 pe:0 ua:0 ap:0
resync: used:0/31 hits:20 misses:2 starving:0 dirty:0 changed:2
act_log: used:0/127 hits:0 misses:0 starving:0 dirty:0 changed:0
hydrogen:~ # drbdsetup r0 get-gi
05BB4DA9C5CC0318:0000000000000000:BDFD6CFC7D6A6454:0A2D9F22E7800B3B:1:1:0:1:0:0
I rebooted oxygen, when I start drbd (service drbd start) I end up with:
oxygen:~ # service drbd start
Starting DRBD resources: [ d0 s0 n0 ].
..........
***************************************************************
DRBD's startup script waits for the peer node(s) to appear.
- In case this node was already a degraded cluster before the
reboot the timeout is 0 seconds. [degr-wfc-timeout]
- If the peer was available before the reboot the timeout will
expire after 0 seconds. [wfc-timeout]
(These values are for resource 'r0'; 0 sec -> wait forever)
To abort waiting enter 'yes' [ 461]:yes
(it sits there until I tell it to abort)
And hydrogen gives me:
hydrogen:~ # cat /proc/drbd
version: 8.0.1 (api:86/proto:86)
SVN Revision: 2784 build by root at hydrogen, 2007-03-05 08:47:01
0: cs:StandAlone st:Primary/Unknown ds:UpToDate/DUnknown r---
ns:0 nr:264 dw:520 dr:316 al:0 bm:2 lo:0 pe:0 ua:0 ap:0
resync: used:0/31 hits:20 misses:2 starving:0 dirty:0 changed:2
act_log: used:0/127 hits:25 misses:0 starving:0 dirty:0 changed:0
hydrogen:~ # drbdsetup r0 get-gi
27CB9D1E08B38869:05BB4DA9C5CC0318:BDFD6CFC7D6A6454:0A2D9F22E7800B3B:1:1:1:0:0:0
So it looks like something is not working, but I have no clue anymore as
to what it could be. I'm temped to go back to the last 0.7.x version
seeing as I don't need primary/primary (and I could get it to work).
Thanks for you time and help.
David
Francesco Ciocchetti wrote:
> There's something wierd happenning here ... at least to me.
>
> If you just rebooted one of the servers it should not happen a SPLIT
> BRAIN but just a state change. What i see from your logs is exactly
> what happened to me that i solved by changing the sb0 value to
> discard-younger-primary. Are you sure that both drbd has been started
> with this option enabled?
>
> i see that the after-sb-0pri just control the split brain when both of
> the nodes are secondary, check what's the situation in your case cause
> maybe you have one node that is actually primary when the split brain
> occurs ... in this case the beahviour is controlled by after-sb-1pri
> and next by after-sb-2pri that in your case are "consensus" then
> "disconnect" == StandAlone.
>
> just my 2 cents
>
> bye
> Francesco
>
> David wrote:
>
>> Before the reboot, the two systems see each other and are in sync.
>> When I try to start drbd on hydrogen (who was master) after rebooting
>> it I get
>> hydrogen:~ # service drbd start
>> Starting DRBD resources: [ d0 s0 n0 ].
>> ..........
>> ***************************************************************
>> DRBD's startup script waits for the peer node(s) to appear.
>> - In case this node was already a degraded cluster before the
>> reboot the timeout is 0 seconds. [degr-wfc-timeout]
>> - If the peer was available before the reboot the timeout will
>> expire after 0 seconds. [wfc-timeout]
>> (These values are for resource 'r0'; 0 sec -> wait forever)
>> To abort waiting enter 'yes' [ 520]:
>>
>>
>> So right away there is a problem. The logs show drbd complaining
>> about a split brain:
>>
>> Mar 5 17:26:49 hydrogen kernel: drbd0: conn( WFConnection ->
>> WFReportParams )
>> Mar 5 17:26:49 hydrogen kernel: drbd0: Handshake successful: DRBD
>> Network Protocol version 86
>> Mar 5 17:26:49 hydrogen kernel: drbd0: Split-Brain detected, dropping
>> connection!
>> Mar 5 17:26:49 hydrogen kernel: drbd0: self
>> C9710AB94F619A7F:F920CFF31F2A1606:C2B9EF60E881089C:2F33912A597BE6F2
>> Mar 5 17:26:49 hydrogen kernel: drbd0: peer
>> CD986B54BF6D0C8B:F920CFF31F2A1607:C2B9EF60E881089D:2F33912A597BE6F2
>> Mar 5 17:26:49 hydrogen kernel: drbd0: conn( WFReportParams ->
>> Disconnecting )
>> Mar 5 17:26:49 hydrogen kernel: drbd0: error receiving ReportState,
>> l: 4!
>> Mar 5 17:26:49 hydrogen kernel: drbd0: asender terminated
>> Mar 5 17:26:49 hydrogen kernel: drbd0: tl_clear()
>> Mar 5 17:26:49 hydrogen kernel: drbd0: Connection closed
>> Mar 5 17:26:49 hydrogen kernel: drbd0: conn( Disconnecting ->
>> StandAlone )
>> Mar 5 17:26:49 hydrogen kernel: drbd0: receiver terminated
>>
>> At the same time, oxygen (now primary) is logging:
>> Mar 5 17:26:49 oxygen kernel: drbd0: conn( WFConnection ->
>> WFReportParams )
>> Mar 5 17:26:49 oxygen kernel: drbd0: Handshake successful: DRBD
>> Network Protocol version 86
>> Mar 5 17:26:49 oxygen kernel: drbd0: Split-Brain detected, dropping
>> connection!
>> Mar 5 17:26:49 oxygen kernel: drbd0: self
>> CD986B54BF6D0C8B:F920CFF31F2A1607:C2B9EF60E881089D:2F33912A597BE6F2
>> Mar 5 17:26:49 oxygen kernel: drbd0: peer
>> C9710AB94F619A7F:F920CFF31F2A1606:C2B9EF60E881089C:2F33912A597BE6F2
>> Mar 5 17:26:49 oxygen kernel: drbd0: conn( WFReportParams ->
>> Disconnecting )
>> Mar 5 17:26:49 oxygen kernel: drbd0: error receiving ReportState, l: 4!
>> Mar 5 17:26:49 oxygen kernel: drbd0: meta connection shut down by peer.
>> Mar 5 17:26:49 oxygen kernel: drbd0: asender terminated
>> Mar 5 17:26:49 oxygen kernel: drbd0: tl_clear()
>> Mar 5 17:26:49 oxygen kernel: drbd0: Connection closed
>> Mar 5 17:26:49 oxygen kernel: drbd0: conn( Disconnecting -> StandAlone )
>> Mar 5 17:26:49 oxygen kernel: drbd0: receiver terminated
>>
>>
>> At this point I am completely confused. I thought hydrogen (the
>> rebooted system) should see that it is out of date and become
>> secondary and resync itself, instead I'm getting split brain. The
>> file system on the drbd partition is XFS and is mounted read only, so
>> no one is writing to partition before, during or after the reboot of
>> hydrogen.
>>
>> Is there a way to print the metadata line (like the one you see in the
>> logs) manually, I'd like to see if it matches before and after
>> reboot. Maybe something is altering the data during shutdown or bootup?
>>
>> Francesco Ciocchetti wrote:
>>
>>> is DRBD correctly starting on hydrogen? do you have session established
>>> beetween nodes (it does not seem so).
>>> what about the logs? there is something there that can justify a
>>> situation like this?
>>> what if you try to force connection and primary state or to
>>> invalidate peer?
>>>
>>> bye
>>> Francesco
>>>
>>> David wrote:
>>>
>>>
>>>> Francesco Ciocchetti wrote:
>>>>
>>>>
>>>>> I' ve a newbie about DRBD but i experienced a problem like your
>>>>> one. In
>>>>> my case the problem was the setting of the following configuration
>>>>> instructions:
>>>>>
>>>>> I had to change the first one to this value to be able to regain from
>>>>> the SB.
>>>>>
>>>>>
>>>>> after-sb-0pri discard-younger-primary;
>>>>> after-sb-1pri consensus;
>>>>> after-sb-2pri disconnect;
>>>>>
>>>>>
>>>>> Bye
>>>>>
>>>>> David wrote:
>>>>>
>>>>>
>>>>>
>>>>>> Before reboot:
>>>>>>
>>>>>> hydrogen:/etc/ha.d # cat /proc/drbd
>>>>>> version: 8.0.1 (api:86/proto:86)
>>>>>> SVN Revision: 2784 build by root at hydrogen, 2007-03-05 08:47:01
>>>>>> 0: cs:Connected st:Primary/Secondary ds:UpToDate/UpToDate C r---
>>>>>> ns:264 nr:0 dw:256 dr:580 al:0 bm:2 lo:0 pe:0 ua:0 ap:0
>>>>>> resync: used:0/31 hits:20 misses:2 starving:0 dirty:0 changed:2
>>>>>> act_log: used:0/127 hits:25 misses:0 starving:0 dirty:0 changed:0
>>>>>>
>>>>>> oxygen:~ # cat /proc/drbd
>>>>>> version: 8.0.1 (api:86/proto:86)
>>>>>> SVN Revision: 2784 build by root at oxygen, 2007-03-05 08:43:02
>>>>>> 0: cs:Connected st:Secondary/Primary ds:UpToDate/UpToDate C r---
>>>>>> ns:0 nr:264 dw:264 dr:0 al:0 bm:2 lo:0 pe:0 ua:0 ap:0
>>>>>> resync: used:0/31 hits:20 misses:2 starving:0 dirty:0 changed:2
>>>>>> act_log: used:0/127 hits:0 misses:0 starving:0 dirty:0 changed:0
>>>>>>
>>>>>>
>>>>>> During hydrogen reboot:
>>>>>> oxygen:~ # cat /proc/drbd
>>>>>> version: 8.0.1 (api:86/proto:86)
>>>>>> SVN Revision: 2784 build by root at oxygen, 2007-03-05 08:43:02
>>>>>> 0: cs:WFConnection st:Primary/Unknown ds:UpToDate/DUnknown C r---
>>>>>> ns:0 nr:264 dw:520 dr:316 al:0 bm:2 lo:0 pe:0 ua:0 ap:0
>>>>>> resync: used:0/31 hits:20 misses:2 starving:0 dirty:0 changed:2
>>>>>> act_log: used:0/127 hits:25 misses:0 starving:0 dirty:0 changed:0
>>>>>>
>>>>>>
>>>>>> Started drbd (no heartbeat) on hydrogen
>>>>>> oxygen:~ # cat /proc/drbd
>>>>>> version: 8.0.1 (api:86/proto:86)
>>>>>> SVN Revision: 2784 build by root at oxygen, 2007-03-05 08:43:02
>>>>>> 0: cs:StandAlone st:Primary/Unknown ds:UpToDate/DUnknown r---
>>>>>> ns:0 nr:264 dw:520 dr:316 al:0 bm:2 lo:0 pe:0 ua:0 ap:0
>>>>>> resync: used:0/31 hits:20 misses:2 starving:0 dirty:0 changed:2
>>>>>> act_log: used:0/127 hits:25 misses:0 starving:0 dirty:0 changed:0
>>>>>>
>>>>>> On hydrogen, I'm seeing:
>>>>>> hydrogen:~ # service drbd start
>>>>>> Starting DRBD resources: [ d0 s0 n0 ].
>>>>>> ..........
>>>>>> ***************************************************************
>>>>>> DRBD's startup script waits for the peer node(s) to appear.
>>>>>> - In case this node was already a degraded cluster before the
>>>>>> reboot the timeout is 0 seconds. [degr-wfc-timeout]
>>>>>> - If the peer was available before the reboot the timeout will
>>>>>> expire after 0 seconds. [wfc-timeout]
>>>>>> (These values are for resource 'r0'; 0 sec -> wait forever)
>>>>>> To abort waiting enter 'yes' [ 520]:
>>>>>>
>>>>>>
>>>>>>
>>>>>> So just starting drbd on hydrogen causes a split brain and oxygen,
>>>>>> now
>>>>>> the primary, to go into a standalone state. Why is that? The file
>>>>>> system is mounted as a read only file system so no changes should be
>>>>>> taking place. This is not a primary/primary setup so there is only
>>>>>> one
>>>>>> "active" node at a time. I was under the impression that the
>>>>>> rebooting
>>>>>> node, hydrogen, should see that it is out of date and become
>>>>>> secondary, resync itself with the primary and stay in the secondary
>>>>>> state until that is changed? Am I wrong?
>>>>>>
>>>>>> Both systems are identical:
>>>>>> SLES 10
>>>>>> kernel 2.6.16.27-0.9-bigsmp
>>>>>> drbd 8.0.1 compiled from source
>>>>>>
>>>>>>
>>>>>> Thanks ahead,
>>>>>>
>>>>>> David
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> drbd-user mailing list
>>>>>> drbd-user at lists.linbit.com
>>>>>> http://lists.linbit.com/mailman/listinfo/drbd-user
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>> Thanks for the response. I'm currently using the settings you talk
>>>> about (sorry, should have included this before):
>>>>
>>>> resource r0 {
>>>>
>>>> protocol C;
>>>>
>>>> net {
>>>> after-sb-0pri discard-younger-primary;
>>>> after-sb-1pri consensus;
>>>> after-sb-2pri disconnect;
>>>> }
>>>>
>>>> syncer {
>>>> rate 120M;
>>>> }
>>>>
>>>> on hydrogen {
>>>> device /dev/drbd0;
>>>> disk /dev/sda4;
>>>> address 172.16.0.2:7788;
>>>> meta-disk /dev/sda3[0];
>>>> }
>>>>
>>>> on oxygen {
>>>> device /dev/drbd0;
>>>> disk /dev/sda4;
>>>> address 172.16.0.1:7788;
>>>> meta-disk /dev/sda3[0];
>>>> }
>>>> }
>>>> _______________________________________________
>>>> drbd-user mailing list
>>>> drbd-user at lists.linbit.com
>>>> http://lists.linbit.com/mailman/listinfo/drbd-user
>>>>
>>>>
>>>>
>>>
>>>
>> _______________________________________________
>> drbd-user mailing list
>> drbd-user at lists.linbit.com
>> http://lists.linbit.com/mailman/listinfo/drbd-user
>>
>>
>
>