[DRBD-user] DRBD Standalone Primary, can't connect

Tue Jan 12 17:27:42 CET 2010

I've read your mail very quickly, bear that in mind when reading below.
If you want to loose data on vm2, put drbd0 on vm2 into secondary (if 
it's still isn't) then on vm2 do drbdadm invalidate resourcenamefordrbd0 
or drbdadm outdate resourcenamefordrbd0. Latter will discard just the 
metadata while invalidate will discard all data on drbd0 on vm2.

But double check if you really want to discard data on vm2 before doing 
that. Also don't put drbd0 on vm1 to secondary if you want to preserve 
it's data.
Invalidate remote, won't work when they are disconnected. :-)
Connect drbd after validating, it should sync successfully latter.
Also double check your drbd.conf for option that keeps the data from the 
device that became master later. It might not be what you want.

Regards,
M.

Adam Sweet wrote:
> Hi
>
> I'm sorry to make my first post here a request for help, but I've 
> inherited a system using DRBD with Heartbeat, something strange 
> happened and I don't have the experience with either to work out how 
> to fix it.
>
> The systems are two Xen instances running Debian Lenny with DRBD 
> 0.7.21 and 2.1.3 installed from the stock Lenny repositories, working 
> as a mailing list server.
>
> About a week ago, something weird happened, it may have been caused by 
> a routing issue at the hosting provider which was detected a few days 
> later. One of the Xen instances, the DRDB secondary, hereafter called 
> vm2, shut down unexpectedly overnight and the following was logged on 
> the primary, hereafter known as vm1:
>
> Jan  4 18:39:33 lists1 kernel: drbd0: PingAck did not arrive in time.
> Jan  4 18:39:33 lists1 kernel: drbd0: drbd0_asender [16239]: cstate 
> Connected --> NetworkFailure
> Jan  4 18:39:33 lists1 kernel: drbd0: asender terminated
> Jan  4 18:39:33 lists1 kernel: drbd0: drbd0_receiver [2577]: cstate 
> NetworkFailure --> BrokenPipe
> Jan  4 18:39:33 lists1 kernel: drbd0: short read expecting header on 
> sock: r=-512
> Jan  4 18:39:33 lists1 kernel: drbd0: worker terminated
> Jan  4 18:39:33 lists1 kernel: drbd0: drbd0_receiver [2577]: cstate 
> BrokenPipe --> Unconnected
> Jan  4 18:39:33 lists1 kernel: drbd0: Connection lost.
> Jan  4 18:39:33 lists1 kernel: drbd0: drbd0_receiver [2577]: cstate 
> Unconnected --> WFConnection
> Jan  4 18:41:37 lists1 kernel: drbd0: drbd0_receiver [2577]: cstate 
> WFConnection --> WFReportParams
> Jan  4 18:41:37 lists1 kernel: drbd0: Handshake successful: DRBD 
> Network Protocol version 74
> Jan  4 18:41:37 lists1 kernel: drbd0: incompatible states (both Primary!)
> Jan  4 18:41:37 lists1 kernel: drbd0: drbd0_receiver [2577]: cstate 
> WFReportParams --> StandAlone
> Jan  4 18:41:37 lists1 kernel: drbd0: error receiving ReportParams, l: 
> 72!
> Jan  4 18:41:37 lists1 kernel: drbd0: worker terminated
> Jan  4 18:41:37 lists1 kernel: drbd0: asender terminated
> Jan  4 18:41:37 lists1 kernel: drbd0: drbd0_receiver [2577]: cstate 
> StandAlone --> StandAlone
> Jan  4 18:41:37 lists1 kernel: drbd0: Connection lost.
> Jan  4 18:41:37 lists1 kernel: drbd0: receiver terminated
>
> When we brought vm2 (the secondary) up in the morning it assumed the 
> primary Heartbeat and DRBD role, even though vm1 also held the primary 
> role, so I powered vm2 off again. It seems as though Heartbeat on vm1 
> had died at some point, so we started it again and it warned about 
> resources already being in use (the heartbeat controlled services, 
> Postgres, Postfix, DRBD, Sympa, crond and atd were already running, 
> despite heartbeat dying), though afterwards we could bring vm2 up 
> without it assuming active heartbeart status and the DRBD primary role:
>
> Currently we are here:
>
> cat /proc/drbd:
>
> vm1:
>
> version: 0.7.21 (api:79/proto:74)
> SVN Revision: 2326 build by root at vm136, 2008-08-14 08:57:36
>  0: cs:StandAlone st:Primary/Unknown ld:Consistent
>     ns:0 nr:0 dw:564848848 dr:244366434 al:640787 bm:72145 lo:0 pe:0 
> ua:0 ap:0
>
> vm2:
>
> version: 0.7.21 (api:79/proto:74)
> SVN Revision: 2326 build by root at vm137, 2008-08-14 08:57:36
>  0: cs:WFConnection st:Secondary/Unknown ld:Consistent
>     ns:0 nr:0 dw:0 dr:0 al:0 bm:127 lo:0 pe:0 ua:0 ap:0
>
> vm1 is currently running the heartbeat controlled services, has the 
> shared IP address and has the DRBD volume mounted and in use.
>
> When we tell vm1 to connect, we see this:
>
> Jan 12 14:56:45 lists1 kernel: drbd0: drbdsetup [26191]: cstate 
> StandAlone --> Unconnected
> Jan 12 14:56:45 lists1 kernel: drbd0: drbd0_receiver [26192]: cstate 
> Unconnected --> WFConnection
> Jan 12 14:56:47 lists1 kernel: drbd0: drbd0_receiver [26192]: cstate 
> WFConnection --> WFReportParams
> Jan 12 14:56:47 lists1 kernel: drbd0: Handshake successful: DRBD 
> Network Protocol version 74
> Jan 12 14:56:47 lists1 kernel: drbd0: Connection established.
> Jan 12 14:56:47 lists1 kernel: drbd0: I am(P): 
> 1:00000002:00000003:0000004f:00000008:10
> Jan 12 14:56:47 lists1 kernel: drbd0: Peer(S): 
> 1:00000002:00000004:0000004a:00000009:10
> Jan 12 14:56:47 lists1 kernel: drbd0: Current Primary shall become 
> sync TARGET! Aborting to prevent data corruption.
> Jan 12 14:56:47 lists1 kernel: drbd0: drbd0_receiver [26192]: cstate 
> WFReportParams --> StandAlone
> Jan 12 14:56:47 lists1 kernel: drbd0: error receiving ReportParams, l: 
> 72!
> Jan 12 14:56:47 lists1 kernel: drbd0: worker terminated
> Jan 12 14:56:47 lists1 kernel: drbd0: asender terminated
> Jan 12 14:56:47 lists1 kernel: drbd0: drbd0_receiver [26192]: cstate 
> StandAlone --> StandAlone
> Jan 12 14:56:47 lists1 kernel: drbd0: Connection lost.
> Jan 12 14:56:47 lists1 kernel: drbd0: receiver terminated
>
> When we tell vm1 to connect and invalidate the remote system, it tells 
> us that it can only be done when it's connected, but as above, we 
> can't get it to connect.
>
> I've looked at the documentation and googled some existing mailing 
> list posts, such as this one:
>
> http://archives.free.net.ph/message/20060619.131041.fd07cb48.en.html
>
> but as this is a busy live system and the customer keeps a close eye 
> on it, I'm reluctant to try anything which might lead to some lengthy 
> downtime for a restore and a list of explanations and apologies to the 
> customer. I'd prefer to ask for your opinion rather than take a guess 
> at a fix.
>
> Can anybody help? If you need more info, please ask and I will be 
> happy to provide.
>
> Regards,
>
> Adam Sweet
>