[DRBD-user] StandAlone/WFConnection - drbdd 0.7.20

Darren Hoch darren.hoch at litemail.org
Sun Mar 18 20:32:18 CET 2007

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hello All,

I setup DRBD months ago using some great recipes from folks like you on 
a 2 node cluster using heartbeat. I have not had any problems or cause 
for administrative intervention until now. A condition occurred where 
the primary node (ha1) generated the following message:

Mar 16 10:54:36 ha1 kernel: drbd0: drbd0_receiver [2969]: cstate 
WFConnection --> WFReportParams
Mar 16 10:54:36 ha1 kernel: drbd0: Handshake successful: DRBD Network 
Protocol version 74
Mar 16 10:54:36 ha1 kernel: drbd0: Connection established.
Mar 16 10:54:36 ha1 kernel: drbd0: I am(P): 
1:00000002:00000001:0000018c:00000004:10
Mar 16 10:54:36 ha1 kernel: drbd0: Peer(S): 
1:00000002:00000001:0000018d:00000003:10
Mar 16 10:54:36 ha1 kernel: drbd0: Current Primary shall become sync 
TARGET! Aborting to prevent data corruption.
Mar 16 10:54:36 ha1 kernel: drbd0: drbd0_receiver [2969]: cstate 
WFReportParams --> StandAlone
Mar 16 10:54:36 ha1 kernel: drbd0: error receiving ReportParams, l: 72!
Mar 16 10:54:36 ha1 kernel: drbd0: asender terminated
Mar 16 10:54:36 ha1 kernel: drbd0: worker terminated
Mar 16 10:54:36 ha1 kernel: drbd0: drbd0_receiver [2969]: cstate 
StandAlone --> StandAlone
Mar 16 10:54:36 ha1 kernel: drbd0: Connection lost.
Mar 16 10:54:36 ha1 kernel: drbd0: receiver terminated

The secondary node went into WFConnection mode:

ha2# drbdadm cstate all
WFConnection

This went on for 6 hours before it was caught and a good bit of data was 
written to the primary disk in StandAlone mode. The only reason why it 
was caught was that we conducted our monthly failover test with 
heartbeat. On the secondary server, we issued a:

ha2# ./hb_takeover

It was at this point that we noticed the data on the secondary node was 
out of sync and did not contain the recent changes. We flipped the 
servers back with:

ha2# ./hb_standby

 From what I have read on mailing lists, both drbd devices have been 
modified and it is unclear exactly which device should sync to which. In 
our case, we wanted the data on the primary (in StandAlone) to sync back 
to the secondary. At this point we figured (incorrectly in hindsight, 
"Current Primary shall become sync TARGET!" is pretty straightforward) 
that we could simply unmount (/data mounted on /dev/drbd0) everything on 
the primary node and restart drbd:

ha1# umount /data
ha1# service drbd restart

This hurt bad as the secondary node (with the out of sync data) zapped 
all the changes on the primary. The sync went the wrong way and we lost 
a couple hours of data.

So my questions are:

1) How did this happen?
2) How do we make sure under all circumstances that we force the sync to 
go the right way? I found a posting that suggested the following 
commands. Being that we goofed once already on a production system, we 
can't afford to have it happen again:

ha2# drbdadm invalidate all
ha1# drbdadm connect all

Will this restart the listener on ha1 and sync ha1 to ha2?

Thanks,

Darren





More information about the drbd-user mailing list