[DRBD-user] DRBD Standalone Primary, can't connect

Tue Jan 12 16:19:15 CET 2010

Hi

I'm sorry to make my first post here a request for help, but I've 
inherited a system using DRBD with Heartbeat, something strange happened 
and I don't have the experience with either to work out how to fix it.

The systems are two Xen instances running Debian Lenny with DRBD 0.7.21 
and 2.1.3 installed from the stock Lenny repositories, working as a 
mailing list server.

About a week ago, something weird happened, it may have been caused by a 
routing issue at the hosting provider which was detected a few days 
later. One of the Xen instances, the DRDB secondary, hereafter called 
vm2, shut down unexpectedly overnight and the following was logged on 
the primary, hereafter known as vm1:

Jan  4 18:39:33 lists1 kernel: drbd0: PingAck did not arrive in time.
Jan  4 18:39:33 lists1 kernel: drbd0: drbd0_asender [16239]: cstate 
Connected --> NetworkFailure
Jan  4 18:39:33 lists1 kernel: drbd0: asender terminated
Jan  4 18:39:33 lists1 kernel: drbd0: drbd0_receiver [2577]: cstate 
NetworkFailure --> BrokenPipe
Jan  4 18:39:33 lists1 kernel: drbd0: short read expecting header on 
sock: r=-512
Jan  4 18:39:33 lists1 kernel: drbd0: worker terminated
Jan  4 18:39:33 lists1 kernel: drbd0: drbd0_receiver [2577]: cstate 
BrokenPipe --> Unconnected
Jan  4 18:39:33 lists1 kernel: drbd0: Connection lost.
Jan  4 18:39:33 lists1 kernel: drbd0: drbd0_receiver [2577]: cstate 
Unconnected --> WFConnection
Jan  4 18:41:37 lists1 kernel: drbd0: drbd0_receiver [2577]: cstate 
WFConnection --> WFReportParams
Jan  4 18:41:37 lists1 kernel: drbd0: Handshake successful: DRBD Network 
Protocol version 74
Jan  4 18:41:37 lists1 kernel: drbd0: incompatible states (both Primary!)
Jan  4 18:41:37 lists1 kernel: drbd0: drbd0_receiver [2577]: cstate 
WFReportParams --> StandAlone
Jan  4 18:41:37 lists1 kernel: drbd0: error receiving ReportParams, l: 72!
Jan  4 18:41:37 lists1 kernel: drbd0: worker terminated
Jan  4 18:41:37 lists1 kernel: drbd0: asender terminated
Jan  4 18:41:37 lists1 kernel: drbd0: drbd0_receiver [2577]: cstate 
StandAlone --> StandAlone
Jan  4 18:41:37 lists1 kernel: drbd0: Connection lost.
Jan  4 18:41:37 lists1 kernel: drbd0: receiver terminated

When we brought vm2 (the secondary) up in the morning it assumed the 
primary Heartbeat and DRBD role, even though vm1 also held the primary 
role, so I powered vm2 off again. It seems as though Heartbeat on vm1 
had died at some point, so we started it again and it warned about 
resources already being in use (the heartbeat controlled services, 
Postgres, Postfix, DRBD, Sympa, crond and atd were already running, 
despite heartbeat dying), though afterwards we could bring vm2 up 
without it assuming active heartbeart status and the DRBD primary role:

Currently we are here:

cat /proc/drbd:

vm1:

version: 0.7.21 (api:79/proto:74)
SVN Revision: 2326 build by root at vm136, 2008-08-14 08:57:36
  0: cs:StandAlone st:Primary/Unknown ld:Consistent
     ns:0 nr:0 dw:564848848 dr:244366434 al:640787 bm:72145 lo:0 pe:0 
ua:0 ap:0

vm2:

version: 0.7.21 (api:79/proto:74)
SVN Revision: 2326 build by root at vm137, 2008-08-14 08:57:36
  0: cs:WFConnection st:Secondary/Unknown ld:Consistent
     ns:0 nr:0 dw:0 dr:0 al:0 bm:127 lo:0 pe:0 ua:0 ap:0

vm1 is currently running the heartbeat controlled services, has the 
shared IP address and has the DRBD volume mounted and in use.

When we tell vm1 to connect, we see this:

Jan 12 14:56:45 lists1 kernel: drbd0: drbdsetup [26191]: cstate 
StandAlone --> Unconnected
Jan 12 14:56:45 lists1 kernel: drbd0: drbd0_receiver [26192]: cstate 
Unconnected --> WFConnection
Jan 12 14:56:47 lists1 kernel: drbd0: drbd0_receiver [26192]: cstate 
WFConnection --> WFReportParams
Jan 12 14:56:47 lists1 kernel: drbd0: Handshake successful: DRBD Network 
Protocol version 74
Jan 12 14:56:47 lists1 kernel: drbd0: Connection established.
Jan 12 14:56:47 lists1 kernel: drbd0: I am(P): 
1:00000002:00000003:0000004f:00000008:10
Jan 12 14:56:47 lists1 kernel: drbd0: Peer(S): 
1:00000002:00000004:0000004a:00000009:10
Jan 12 14:56:47 lists1 kernel: drbd0: Current Primary shall become sync 
TARGET! Aborting to prevent data corruption.
Jan 12 14:56:47 lists1 kernel: drbd0: drbd0_receiver [26192]: cstate 
WFReportParams --> StandAlone
Jan 12 14:56:47 lists1 kernel: drbd0: error receiving ReportParams, l: 72!
Jan 12 14:56:47 lists1 kernel: drbd0: worker terminated
Jan 12 14:56:47 lists1 kernel: drbd0: asender terminated
Jan 12 14:56:47 lists1 kernel: drbd0: drbd0_receiver [26192]: cstate 
StandAlone --> StandAlone
Jan 12 14:56:47 lists1 kernel: drbd0: Connection lost.
Jan 12 14:56:47 lists1 kernel: drbd0: receiver terminated

When we tell vm1 to connect and invalidate the remote system, it tells 
us that it can only be done when it's connected, but as above, we can't 
get it to connect.

I've looked at the documentation and googled some existing mailing list 
posts, such as this one:

http://archives.free.net.ph/message/20060619.131041.fd07cb48.en.html

but as this is a busy live system and the customer keeps a close eye on 
it, I'm reluctant to try anything which might lead to some lengthy 
downtime for a restore and a list of explanations and apologies to the 
customer. I'd prefer to ask for your opinion rather than take a guess at 
a fix.

Can anybody help? If you need more info, please ask and I will be happy 
to provide.

Regards,

Adam Sweet

-- 

http://blog.adamsweet.org/