[DRBD-user] StandAlone/WFConnection - drbdd 0.7.20
Darren Hoch
darren.hoch at litemail.org
Sun Mar 18 20:32:18 CET 2007
Hello All,
I setup DRBD months ago using some great recipes from folks like you on
a 2 node cluster using heartbeat. I have not had any problems or cause
for administrative intervention until now. A condition occurred where
the primary node (ha1) generated the following message:
Mar 16 10:54:36 ha1 kernel: drbd0: drbd0_receiver [2969]: cstate
WFConnection --> WFReportParams
Mar 16 10:54:36 ha1 kernel: drbd0: Handshake successful: DRBD Network
Protocol version 74
Mar 16 10:54:36 ha1 kernel: drbd0: Connection established.
Mar 16 10:54:36 ha1 kernel: drbd0: I am(P):
1:00000002:00000001:0000018c:00000004:10
Mar 16 10:54:36 ha1 kernel: drbd0: Peer(S):
1:00000002:00000001:0000018d:00000003:10
Mar 16 10:54:36 ha1 kernel: drbd0: Current Primary shall become sync
TARGET! Aborting to prevent data corruption.
Mar 16 10:54:36 ha1 kernel: drbd0: drbd0_receiver [2969]: cstate
WFReportParams --> StandAlone
Mar 16 10:54:36 ha1 kernel: drbd0: error receiving ReportParams, l: 72!
Mar 16 10:54:36 ha1 kernel: drbd0: asender terminated
Mar 16 10:54:36 ha1 kernel: drbd0: worker terminated
Mar 16 10:54:36 ha1 kernel: drbd0: drbd0_receiver [2969]: cstate
StandAlone --> StandAlone
Mar 16 10:54:36 ha1 kernel: drbd0: Connection lost.
Mar 16 10:54:36 ha1 kernel: drbd0: receiver terminated
The secondary node went into WFConnection mode:
ha2# drbdadm cstate all
WFConnection
This went on for 6 hours before it was caught and a good bit of data was
written to the primary disk in StandAlone mode. The only reason why it
was caught was that we conducted our monthly failover test with
heartbeat. On the secondary server, we issued a:
ha2# ./hb_takeover
It was at this point that we noticed the data on the secondary node was
out of sync and did not contain the recent changes. We flipped the
servers back with:
ha2# ./hb_standby
From what I have read on mailing lists, both drbd devices have been
modified and it is unclear exactly which device should sync to which. In
our case, we wanted the data on the primary (in StandAlone) to sync back
to the secondary. At this point we figured (incorrectly in hindsight,
"Current Primary shall become sync TARGET!" is pretty straightforward)
that we could simply unmount (/data mounted on /dev/drbd0) everything on
the primary node and restart drbd:
ha1# umount /data
ha1# service drbd restart
This hurt bad as the secondary node (with the out of sync data) zapped
all the changes on the primary. The sync went the wrong way and we lost
a couple hours of data.
So my questions are:
1) How did this happen?
2) How do we make sure under all circumstances that we force the sync to
go the right way? I found a posting that suggested the following
commands. Being that we goofed once already on a production system, we
can't afford to have it happen again:
ha2# drbdadm invalidate all
ha1# drbdadm connect all
Will this restart the listener on ha1 and sync ha1 to ha2?
Thanks,
Darren
More information about the drbd-user
mailing list