Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi I'm sorry to make my first post here a request for help, but I've inherited a system using DRBD with Heartbeat, something strange happened and I don't have the experience with either to work out how to fix it. The systems are two Xen instances running Debian Lenny with DRBD 0.7.21 and 2.1.3 installed from the stock Lenny repositories, working as a mailing list server. About a week ago, something weird happened, it may have been caused by a routing issue at the hosting provider which was detected a few days later. One of the Xen instances, the DRDB secondary, hereafter called vm2, shut down unexpectedly overnight and the following was logged on the primary, hereafter known as vm1: Jan 4 18:39:33 lists1 kernel: drbd0: PingAck did not arrive in time. Jan 4 18:39:33 lists1 kernel: drbd0: drbd0_asender [16239]: cstate Connected --> NetworkFailure Jan 4 18:39:33 lists1 kernel: drbd0: asender terminated Jan 4 18:39:33 lists1 kernel: drbd0: drbd0_receiver [2577]: cstate NetworkFailure --> BrokenPipe Jan 4 18:39:33 lists1 kernel: drbd0: short read expecting header on sock: r=-512 Jan 4 18:39:33 lists1 kernel: drbd0: worker terminated Jan 4 18:39:33 lists1 kernel: drbd0: drbd0_receiver [2577]: cstate BrokenPipe --> Unconnected Jan 4 18:39:33 lists1 kernel: drbd0: Connection lost. Jan 4 18:39:33 lists1 kernel: drbd0: drbd0_receiver [2577]: cstate Unconnected --> WFConnection Jan 4 18:41:37 lists1 kernel: drbd0: drbd0_receiver [2577]: cstate WFConnection --> WFReportParams Jan 4 18:41:37 lists1 kernel: drbd0: Handshake successful: DRBD Network Protocol version 74 Jan 4 18:41:37 lists1 kernel: drbd0: incompatible states (both Primary!) Jan 4 18:41:37 lists1 kernel: drbd0: drbd0_receiver [2577]: cstate WFReportParams --> StandAlone Jan 4 18:41:37 lists1 kernel: drbd0: error receiving ReportParams, l: 72! Jan 4 18:41:37 lists1 kernel: drbd0: worker terminated Jan 4 18:41:37 lists1 kernel: drbd0: asender terminated Jan 4 18:41:37 lists1 kernel: drbd0: drbd0_receiver [2577]: cstate StandAlone --> StandAlone Jan 4 18:41:37 lists1 kernel: drbd0: Connection lost. Jan 4 18:41:37 lists1 kernel: drbd0: receiver terminated When we brought vm2 (the secondary) up in the morning it assumed the primary Heartbeat and DRBD role, even though vm1 also held the primary role, so I powered vm2 off again. It seems as though Heartbeat on vm1 had died at some point, so we started it again and it warned about resources already being in use (the heartbeat controlled services, Postgres, Postfix, DRBD, Sympa, crond and atd were already running, despite heartbeat dying), though afterwards we could bring vm2 up without it assuming active heartbeart status and the DRBD primary role: Currently we are here: cat /proc/drbd: vm1: version: 0.7.21 (api:79/proto:74) SVN Revision: 2326 build by root at vm136, 2008-08-14 08:57:36 0: cs:StandAlone st:Primary/Unknown ld:Consistent ns:0 nr:0 dw:564848848 dr:244366434 al:640787 bm:72145 lo:0 pe:0 ua:0 ap:0 vm2: version: 0.7.21 (api:79/proto:74) SVN Revision: 2326 build by root at vm137, 2008-08-14 08:57:36 0: cs:WFConnection st:Secondary/Unknown ld:Consistent ns:0 nr:0 dw:0 dr:0 al:0 bm:127 lo:0 pe:0 ua:0 ap:0 vm1 is currently running the heartbeat controlled services, has the shared IP address and has the DRBD volume mounted and in use. When we tell vm1 to connect, we see this: Jan 12 14:56:45 lists1 kernel: drbd0: drbdsetup [26191]: cstate StandAlone --> Unconnected Jan 12 14:56:45 lists1 kernel: drbd0: drbd0_receiver [26192]: cstate Unconnected --> WFConnection Jan 12 14:56:47 lists1 kernel: drbd0: drbd0_receiver [26192]: cstate WFConnection --> WFReportParams Jan 12 14:56:47 lists1 kernel: drbd0: Handshake successful: DRBD Network Protocol version 74 Jan 12 14:56:47 lists1 kernel: drbd0: Connection established. Jan 12 14:56:47 lists1 kernel: drbd0: I am(P): 1:00000002:00000003:0000004f:00000008:10 Jan 12 14:56:47 lists1 kernel: drbd0: Peer(S): 1:00000002:00000004:0000004a:00000009:10 Jan 12 14:56:47 lists1 kernel: drbd0: Current Primary shall become sync TARGET! Aborting to prevent data corruption. Jan 12 14:56:47 lists1 kernel: drbd0: drbd0_receiver [26192]: cstate WFReportParams --> StandAlone Jan 12 14:56:47 lists1 kernel: drbd0: error receiving ReportParams, l: 72! Jan 12 14:56:47 lists1 kernel: drbd0: worker terminated Jan 12 14:56:47 lists1 kernel: drbd0: asender terminated Jan 12 14:56:47 lists1 kernel: drbd0: drbd0_receiver [26192]: cstate StandAlone --> StandAlone Jan 12 14:56:47 lists1 kernel: drbd0: Connection lost. Jan 12 14:56:47 lists1 kernel: drbd0: receiver terminated When we tell vm1 to connect and invalidate the remote system, it tells us that it can only be done when it's connected, but as above, we can't get it to connect. I've looked at the documentation and googled some existing mailing list posts, such as this one: http://archives.free.net.ph/message/20060619.131041.fd07cb48.en.html but as this is a busy live system and the customer keeps a close eye on it, I'm reluctant to try anything which might lead to some lengthy downtime for a restore and a list of explanations and apologies to the customer. I'd prefer to ask for your opinion rather than take a guess at a fix. Can anybody help? If you need more info, please ask and I will be happy to provide. Regards, Adam Sweet -- http://blog.adamsweet.org/