Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Weird is definitely one way to describe it. I just ran other test, oxygen was primary, hydrogen was secondary: oxygen:/etc/ha.d # cat /proc/drbd version: 8.0.1 (api:86/proto:86) SVN Revision: 2784 build by root at oxygen, 2007-03-05 08:43:02 0: cs:Connected st:Primary/Secondary ds:UpToDate/UpToDate C r--- ns:520456 nr:256 dw:521490 dr:521377 al:0 bm:188 lo:0 pe:0 ua:0 ap:0 resync: used:0/31 hits:64942 misses:170 starving:0 dirty:0 changed:170 act_log: used:0/127 hits:52 misses:0 starving:0 dirty:0 changed:0 oxygen:/etc/ha.d # drbdsetup r0 get-gi 05BB4DA9C5CC0319:0000000000000000:BDFD6CFC7D6A6454:0A2D9F22E7800B3B:1:1:1:1:0:0 hydrogen:~ # cat /proc/drbd version: 8.0.1 (api:86/proto:86) SVN Revision: 2784 build by root at hydrogen, 2007-03-05 08:47:01 0: cs:Connected st:Secondary/Primary ds:UpToDate/UpToDate C r--- ns:0 nr:264 dw:264 dr:0 al:0 bm:2 lo:0 pe:0 ua:0 ap:0 resync: used:0/31 hits:20 misses:2 starving:0 dirty:0 changed:2 act_log: used:0/127 hits:0 misses:0 starving:0 dirty:0 changed:0 hydrogen:~ # drbdsetup r0 get-gi 05BB4DA9C5CC0318:0000000000000000:BDFD6CFC7D6A6454:0A2D9F22E7800B3B:1:1:0:1:0:0 I rebooted oxygen, when I start drbd (service drbd start) I end up with: oxygen:~ # service drbd start Starting DRBD resources: [ d0 s0 n0 ]. .......... *************************************************************** DRBD's startup script waits for the peer node(s) to appear. - In case this node was already a degraded cluster before the reboot the timeout is 0 seconds. [degr-wfc-timeout] - If the peer was available before the reboot the timeout will expire after 0 seconds. [wfc-timeout] (These values are for resource 'r0'; 0 sec -> wait forever) To abort waiting enter 'yes' [ 461]:yes (it sits there until I tell it to abort) And hydrogen gives me: hydrogen:~ # cat /proc/drbd version: 8.0.1 (api:86/proto:86) SVN Revision: 2784 build by root at hydrogen, 2007-03-05 08:47:01 0: cs:StandAlone st:Primary/Unknown ds:UpToDate/DUnknown r--- ns:0 nr:264 dw:520 dr:316 al:0 bm:2 lo:0 pe:0 ua:0 ap:0 resync: used:0/31 hits:20 misses:2 starving:0 dirty:0 changed:2 act_log: used:0/127 hits:25 misses:0 starving:0 dirty:0 changed:0 hydrogen:~ # drbdsetup r0 get-gi 27CB9D1E08B38869:05BB4DA9C5CC0318:BDFD6CFC7D6A6454:0A2D9F22E7800B3B:1:1:1:0:0:0 So it looks like something is not working, but I have no clue anymore as to what it could be. I'm temped to go back to the last 0.7.x version seeing as I don't need primary/primary (and I could get it to work). Thanks for you time and help. David Francesco Ciocchetti wrote: > There's something wierd happenning here ... at least to me. > > If you just rebooted one of the servers it should not happen a SPLIT > BRAIN but just a state change. What i see from your logs is exactly > what happened to me that i solved by changing the sb0 value to > discard-younger-primary. Are you sure that both drbd has been started > with this option enabled? > > i see that the after-sb-0pri just control the split brain when both of > the nodes are secondary, check what's the situation in your case cause > maybe you have one node that is actually primary when the split brain > occurs ... in this case the beahviour is controlled by after-sb-1pri > and next by after-sb-2pri that in your case are "consensus" then > "disconnect" == StandAlone. > > just my 2 cents > > bye > Francesco > > David wrote: > >> Before the reboot, the two systems see each other and are in sync. >> When I try to start drbd on hydrogen (who was master) after rebooting >> it I get >> hydrogen:~ # service drbd start >> Starting DRBD resources: [ d0 s0 n0 ]. >> .......... >> *************************************************************** >> DRBD's startup script waits for the peer node(s) to appear. >> - In case this node was already a degraded cluster before the >> reboot the timeout is 0 seconds. [degr-wfc-timeout] >> - If the peer was available before the reboot the timeout will >> expire after 0 seconds. [wfc-timeout] >> (These values are for resource 'r0'; 0 sec -> wait forever) >> To abort waiting enter 'yes' [ 520]: >> >> >> So right away there is a problem. The logs show drbd complaining >> about a split brain: >> >> Mar 5 17:26:49 hydrogen kernel: drbd0: conn( WFConnection -> >> WFReportParams ) >> Mar 5 17:26:49 hydrogen kernel: drbd0: Handshake successful: DRBD >> Network Protocol version 86 >> Mar 5 17:26:49 hydrogen kernel: drbd0: Split-Brain detected, dropping >> connection! >> Mar 5 17:26:49 hydrogen kernel: drbd0: self >> C9710AB94F619A7F:F920CFF31F2A1606:C2B9EF60E881089C:2F33912A597BE6F2 >> Mar 5 17:26:49 hydrogen kernel: drbd0: peer >> CD986B54BF6D0C8B:F920CFF31F2A1607:C2B9EF60E881089D:2F33912A597BE6F2 >> Mar 5 17:26:49 hydrogen kernel: drbd0: conn( WFReportParams -> >> Disconnecting ) >> Mar 5 17:26:49 hydrogen kernel: drbd0: error receiving ReportState, >> l: 4! >> Mar 5 17:26:49 hydrogen kernel: drbd0: asender terminated >> Mar 5 17:26:49 hydrogen kernel: drbd0: tl_clear() >> Mar 5 17:26:49 hydrogen kernel: drbd0: Connection closed >> Mar 5 17:26:49 hydrogen kernel: drbd0: conn( Disconnecting -> >> StandAlone ) >> Mar 5 17:26:49 hydrogen kernel: drbd0: receiver terminated >> >> At the same time, oxygen (now primary) is logging: >> Mar 5 17:26:49 oxygen kernel: drbd0: conn( WFConnection -> >> WFReportParams ) >> Mar 5 17:26:49 oxygen kernel: drbd0: Handshake successful: DRBD >> Network Protocol version 86 >> Mar 5 17:26:49 oxygen kernel: drbd0: Split-Brain detected, dropping >> connection! >> Mar 5 17:26:49 oxygen kernel: drbd0: self >> CD986B54BF6D0C8B:F920CFF31F2A1607:C2B9EF60E881089D:2F33912A597BE6F2 >> Mar 5 17:26:49 oxygen kernel: drbd0: peer >> C9710AB94F619A7F:F920CFF31F2A1606:C2B9EF60E881089C:2F33912A597BE6F2 >> Mar 5 17:26:49 oxygen kernel: drbd0: conn( WFReportParams -> >> Disconnecting ) >> Mar 5 17:26:49 oxygen kernel: drbd0: error receiving ReportState, l: 4! >> Mar 5 17:26:49 oxygen kernel: drbd0: meta connection shut down by peer. >> Mar 5 17:26:49 oxygen kernel: drbd0: asender terminated >> Mar 5 17:26:49 oxygen kernel: drbd0: tl_clear() >> Mar 5 17:26:49 oxygen kernel: drbd0: Connection closed >> Mar 5 17:26:49 oxygen kernel: drbd0: conn( Disconnecting -> StandAlone ) >> Mar 5 17:26:49 oxygen kernel: drbd0: receiver terminated >> >> >> At this point I am completely confused. I thought hydrogen (the >> rebooted system) should see that it is out of date and become >> secondary and resync itself, instead I'm getting split brain. The >> file system on the drbd partition is XFS and is mounted read only, so >> no one is writing to partition before, during or after the reboot of >> hydrogen. >> >> Is there a way to print the metadata line (like the one you see in the >> logs) manually, I'd like to see if it matches before and after >> reboot. Maybe something is altering the data during shutdown or bootup? >> >> Francesco Ciocchetti wrote: >> >>> is DRBD correctly starting on hydrogen? do you have session established >>> beetween nodes (it does not seem so). >>> what about the logs? there is something there that can justify a >>> situation like this? >>> what if you try to force connection and primary state or to >>> invalidate peer? >>> >>> bye >>> Francesco >>> >>> David wrote: >>> >>> >>>> Francesco Ciocchetti wrote: >>>> >>>> >>>>> I' ve a newbie about DRBD but i experienced a problem like your >>>>> one. In >>>>> my case the problem was the setting of the following configuration >>>>> instructions: >>>>> >>>>> I had to change the first one to this value to be able to regain from >>>>> the SB. >>>>> >>>>> >>>>> after-sb-0pri discard-younger-primary; >>>>> after-sb-1pri consensus; >>>>> after-sb-2pri disconnect; >>>>> >>>>> >>>>> Bye >>>>> >>>>> David wrote: >>>>> >>>>> >>>>> >>>>>> Before reboot: >>>>>> >>>>>> hydrogen:/etc/ha.d # cat /proc/drbd >>>>>> version: 8.0.1 (api:86/proto:86) >>>>>> SVN Revision: 2784 build by root at hydrogen, 2007-03-05 08:47:01 >>>>>> 0: cs:Connected st:Primary/Secondary ds:UpToDate/UpToDate C r--- >>>>>> ns:264 nr:0 dw:256 dr:580 al:0 bm:2 lo:0 pe:0 ua:0 ap:0 >>>>>> resync: used:0/31 hits:20 misses:2 starving:0 dirty:0 changed:2 >>>>>> act_log: used:0/127 hits:25 misses:0 starving:0 dirty:0 changed:0 >>>>>> >>>>>> oxygen:~ # cat /proc/drbd >>>>>> version: 8.0.1 (api:86/proto:86) >>>>>> SVN Revision: 2784 build by root at oxygen, 2007-03-05 08:43:02 >>>>>> 0: cs:Connected st:Secondary/Primary ds:UpToDate/UpToDate C r--- >>>>>> ns:0 nr:264 dw:264 dr:0 al:0 bm:2 lo:0 pe:0 ua:0 ap:0 >>>>>> resync: used:0/31 hits:20 misses:2 starving:0 dirty:0 changed:2 >>>>>> act_log: used:0/127 hits:0 misses:0 starving:0 dirty:0 changed:0 >>>>>> >>>>>> >>>>>> During hydrogen reboot: >>>>>> oxygen:~ # cat /proc/drbd >>>>>> version: 8.0.1 (api:86/proto:86) >>>>>> SVN Revision: 2784 build by root at oxygen, 2007-03-05 08:43:02 >>>>>> 0: cs:WFConnection st:Primary/Unknown ds:UpToDate/DUnknown C r--- >>>>>> ns:0 nr:264 dw:520 dr:316 al:0 bm:2 lo:0 pe:0 ua:0 ap:0 >>>>>> resync: used:0/31 hits:20 misses:2 starving:0 dirty:0 changed:2 >>>>>> act_log: used:0/127 hits:25 misses:0 starving:0 dirty:0 changed:0 >>>>>> >>>>>> >>>>>> Started drbd (no heartbeat) on hydrogen >>>>>> oxygen:~ # cat /proc/drbd >>>>>> version: 8.0.1 (api:86/proto:86) >>>>>> SVN Revision: 2784 build by root at oxygen, 2007-03-05 08:43:02 >>>>>> 0: cs:StandAlone st:Primary/Unknown ds:UpToDate/DUnknown r--- >>>>>> ns:0 nr:264 dw:520 dr:316 al:0 bm:2 lo:0 pe:0 ua:0 ap:0 >>>>>> resync: used:0/31 hits:20 misses:2 starving:0 dirty:0 changed:2 >>>>>> act_log: used:0/127 hits:25 misses:0 starving:0 dirty:0 changed:0 >>>>>> >>>>>> On hydrogen, I'm seeing: >>>>>> hydrogen:~ # service drbd start >>>>>> Starting DRBD resources: [ d0 s0 n0 ]. >>>>>> .......... >>>>>> *************************************************************** >>>>>> DRBD's startup script waits for the peer node(s) to appear. >>>>>> - In case this node was already a degraded cluster before the >>>>>> reboot the timeout is 0 seconds. [degr-wfc-timeout] >>>>>> - If the peer was available before the reboot the timeout will >>>>>> expire after 0 seconds. [wfc-timeout] >>>>>> (These values are for resource 'r0'; 0 sec -> wait forever) >>>>>> To abort waiting enter 'yes' [ 520]: >>>>>> >>>>>> >>>>>> >>>>>> So just starting drbd on hydrogen causes a split brain and oxygen, >>>>>> now >>>>>> the primary, to go into a standalone state. Why is that? The file >>>>>> system is mounted as a read only file system so no changes should be >>>>>> taking place. This is not a primary/primary setup so there is only >>>>>> one >>>>>> "active" node at a time. I was under the impression that the >>>>>> rebooting >>>>>> node, hydrogen, should see that it is out of date and become >>>>>> secondary, resync itself with the primary and stay in the secondary >>>>>> state until that is changed? Am I wrong? >>>>>> >>>>>> Both systems are identical: >>>>>> SLES 10 >>>>>> kernel 2.6.16.27-0.9-bigsmp >>>>>> drbd 8.0.1 compiled from source >>>>>> >>>>>> >>>>>> Thanks ahead, >>>>>> >>>>>> David >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> drbd-user mailing list >>>>>> drbd-user at lists.linbit.com >>>>>> http://lists.linbit.com/mailman/listinfo/drbd-user >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>> Thanks for the response. I'm currently using the settings you talk >>>> about (sorry, should have included this before): >>>> >>>> resource r0 { >>>> >>>> protocol C; >>>> >>>> net { >>>> after-sb-0pri discard-younger-primary; >>>> after-sb-1pri consensus; >>>> after-sb-2pri disconnect; >>>> } >>>> >>>> syncer { >>>> rate 120M; >>>> } >>>> >>>> on hydrogen { >>>> device /dev/drbd0; >>>> disk /dev/sda4; >>>> address 172.16.0.2:7788; >>>> meta-disk /dev/sda3[0]; >>>> } >>>> >>>> on oxygen { >>>> device /dev/drbd0; >>>> disk /dev/sda4; >>>> address 172.16.0.1:7788; >>>> meta-disk /dev/sda3[0]; >>>> } >>>> } >>>> _______________________________________________ >>>> drbd-user mailing list >>>> drbd-user at lists.linbit.com >>>> http://lists.linbit.com/mailman/listinfo/drbd-user >>>> >>>> >>>> >>> >>> >> _______________________________________________ >> drbd-user mailing list >> drbd-user at lists.linbit.com >> http://lists.linbit.com/mailman/listinfo/drbd-user >> >> > >