[DRBD-user] DRBD Standalone Primary, can't connect

Adam Sweet adam at adamsweet.org
Thu Jan 21 12:43:10 CET 2010

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Martin Gombač wrote:
> I've read your mail very quickly, bear that in mind when reading below.
> If you want to loose data on vm2, put drbd0 on vm2 into secondary (if 
> it's still isn't) then on vm2 do drbdadm invalidate resourcenamefordrbd0 
> or drbdadm outdate resourcenamefordrbd0. Latter will discard just the 
> metadata while invalidate will discard all data on drbd0 on vm2.
> 
> But double check if you really want to discard data on vm2 before doing 
> that. Also don't put drbd0 on vm1 to secondary if you want to preserve 
> it's data.
> Invalidate remote, won't work when they are disconnected. :-)
> Connect drbd after validating, it should sync successfully latter.
> Also double check your drbd.conf for option that keeps the data from the 
> device that became master later. It might not be what you want.

Many thanks for your advice, I drafted a response a number of times as I 
had already tried most of the suggestions you made without success. We 
couldn't get either host to do anything when told to connect to the 
resource and couldn't invalidate vm2's local copy, as it was disconnected.

Eventually we came across an error message on the console when 
(re)starting Heartbeat which said that some resources were already in 
use and something to do with a semaphore, I forget exactly as it wasn't 
actually logged anywhere. We concluded that something some kind of 
resource split had happened.

As a resolution, we shutdown vm2 (the secondary), then shutdown vm1 
(primary), so that all resources were released, brought vm1 back up, 
then bought vm2 back up. Everything was ok again with vm1 as primary and 
vm2 resynced from vm1.

Thanks for your help, just thought I should let you know to close the 
issue in people's minds.

Regards,

Adam Sweet

-- 

http://blog.adamsweet.org/


> Adam Sweet wrote:
>> Hi
>>
>> I'm sorry to make my first post here a request for help, but I've 
>> inherited a system using DRBD with Heartbeat, something strange 
>> happened and I don't have the experience with either to work out how 
>> to fix it.
>>
>> The systems are two Xen instances running Debian Lenny with DRBD 
>> 0.7.21 and 2.1.3 installed from the stock Lenny repositories, working 
>> as a mailing list server.
>>
>> About a week ago, something weird happened, it may have been caused by 
>> a routing issue at the hosting provider which was detected a few days 
>> later. One of the Xen instances, the DRDB secondary, hereafter called 
>> vm2, shut down unexpectedly overnight and the following was logged on 
>> the primary, hereafter known as vm1:
>>
>> Jan  4 18:39:33 lists1 kernel: drbd0: PingAck did not arrive in time.
>> Jan  4 18:39:33 lists1 kernel: drbd0: drbd0_asender [16239]: cstate 
>> Connected --> NetworkFailure
>> Jan  4 18:39:33 lists1 kernel: drbd0: asender terminated
>> Jan  4 18:39:33 lists1 kernel: drbd0: drbd0_receiver [2577]: cstate 
>> NetworkFailure --> BrokenPipe
>> Jan  4 18:39:33 lists1 kernel: drbd0: short read expecting header on 
>> sock: r=-512
>> Jan  4 18:39:33 lists1 kernel: drbd0: worker terminated
>> Jan  4 18:39:33 lists1 kernel: drbd0: drbd0_receiver [2577]: cstate 
>> BrokenPipe --> Unconnected
>> Jan  4 18:39:33 lists1 kernel: drbd0: Connection lost.
>> Jan  4 18:39:33 lists1 kernel: drbd0: drbd0_receiver [2577]: cstate 
>> Unconnected --> WFConnection
>> Jan  4 18:41:37 lists1 kernel: drbd0: drbd0_receiver [2577]: cstate 
>> WFConnection --> WFReportParams
>> Jan  4 18:41:37 lists1 kernel: drbd0: Handshake successful: DRBD 
>> Network Protocol version 74
>> Jan  4 18:41:37 lists1 kernel: drbd0: incompatible states (both Primary!)
>> Jan  4 18:41:37 lists1 kernel: drbd0: drbd0_receiver [2577]: cstate 
>> WFReportParams --> StandAlone
>> Jan  4 18:41:37 lists1 kernel: drbd0: error receiving ReportParams, l: 
>> 72!
>> Jan  4 18:41:37 lists1 kernel: drbd0: worker terminated
>> Jan  4 18:41:37 lists1 kernel: drbd0: asender terminated
>> Jan  4 18:41:37 lists1 kernel: drbd0: drbd0_receiver [2577]: cstate 
>> StandAlone --> StandAlone
>> Jan  4 18:41:37 lists1 kernel: drbd0: Connection lost.
>> Jan  4 18:41:37 lists1 kernel: drbd0: receiver terminated
>>
>> When we brought vm2 (the secondary) up in the morning it assumed the 
>> primary Heartbeat and DRBD role, even though vm1 also held the primary 
>> role, so I powered vm2 off again. It seems as though Heartbeat on vm1 
>> had died at some point, so we started it again and it warned about 
>> resources already being in use (the heartbeat controlled services, 
>> Postgres, Postfix, DRBD, Sympa, crond and atd were already running, 
>> despite heartbeat dying), though afterwards we could bring vm2 up 
>> without it assuming active heartbeart status and the DRBD primary role:
>>
>> Currently we are here:
>>
>> cat /proc/drbd:
>>
>> vm1:
>>
>> version: 0.7.21 (api:79/proto:74)
>> SVN Revision: 2326 build by root at vm136, 2008-08-14 08:57:36
>>  0: cs:StandAlone st:Primary/Unknown ld:Consistent
>>     ns:0 nr:0 dw:564848848 dr:244366434 al:640787 bm:72145 lo:0 pe:0 
>> ua:0 ap:0
>>
>> vm2:
>>
>> version: 0.7.21 (api:79/proto:74)
>> SVN Revision: 2326 build by root at vm137, 2008-08-14 08:57:36
>>  0: cs:WFConnection st:Secondary/Unknown ld:Consistent
>>     ns:0 nr:0 dw:0 dr:0 al:0 bm:127 lo:0 pe:0 ua:0 ap:0
>>
>> vm1 is currently running the heartbeat controlled services, has the 
>> shared IP address and has the DRBD volume mounted and in use.
>>
>> When we tell vm1 to connect, we see this:
>>
>> Jan 12 14:56:45 lists1 kernel: drbd0: drbdsetup [26191]: cstate 
>> StandAlone --> Unconnected
>> Jan 12 14:56:45 lists1 kernel: drbd0: drbd0_receiver [26192]: cstate 
>> Unconnected --> WFConnection
>> Jan 12 14:56:47 lists1 kernel: drbd0: drbd0_receiver [26192]: cstate 
>> WFConnection --> WFReportParams
>> Jan 12 14:56:47 lists1 kernel: drbd0: Handshake successful: DRBD 
>> Network Protocol version 74
>> Jan 12 14:56:47 lists1 kernel: drbd0: Connection established.
>> Jan 12 14:56:47 lists1 kernel: drbd0: I am(P): 
>> 1:00000002:00000003:0000004f:00000008:10
>> Jan 12 14:56:47 lists1 kernel: drbd0: Peer(S): 
>> 1:00000002:00000004:0000004a:00000009:10
>> Jan 12 14:56:47 lists1 kernel: drbd0: Current Primary shall become 
>> sync TARGET! Aborting to prevent data corruption.
>> Jan 12 14:56:47 lists1 kernel: drbd0: drbd0_receiver [26192]: cstate 
>> WFReportParams --> StandAlone
>> Jan 12 14:56:47 lists1 kernel: drbd0: error receiving ReportParams, l: 
>> 72!
>> Jan 12 14:56:47 lists1 kernel: drbd0: worker terminated
>> Jan 12 14:56:47 lists1 kernel: drbd0: asender terminated
>> Jan 12 14:56:47 lists1 kernel: drbd0: drbd0_receiver [26192]: cstate 
>> StandAlone --> StandAlone
>> Jan 12 14:56:47 lists1 kernel: drbd0: Connection lost.
>> Jan 12 14:56:47 lists1 kernel: drbd0: receiver terminated
>>
>> When we tell vm1 to connect and invalidate the remote system, it tells 
>> us that it can only be done when it's connected, but as above, we 
>> can't get it to connect.
>>
>> I've looked at the documentation and googled some existing mailing 
>> list posts, such as this one:
>>
>> http://archives.free.net.ph/message/20060619.131041.fd07cb48.en.html
>>
>> but as this is a busy live system and the customer keeps a close eye 
>> on it, I'm reluctant to try anything which might lead to some lengthy 
>> downtime for a restore and a list of explanations and apologies to the 
>> customer. I'd prefer to ask for your opinion rather than take a guess 
>> at a fix.
>>
>> Can anybody help? If you need more info, please ask and I will be 
>> happy to provide.
>>
>> Regards,
>>
>> Adam Sweet
>>



More information about the drbd-user mailing list