Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello All, I setup DRBD months ago using some great recipes from folks like you on a 2 node cluster using heartbeat. I have not had any problems or cause for administrative intervention until now. A condition occurred where the primary node (ha1) generated the following message: Mar 16 10:54:36 ha1 kernel: drbd0: drbd0_receiver [2969]: cstate WFConnection --> WFReportParams Mar 16 10:54:36 ha1 kernel: drbd0: Handshake successful: DRBD Network Protocol version 74 Mar 16 10:54:36 ha1 kernel: drbd0: Connection established. Mar 16 10:54:36 ha1 kernel: drbd0: I am(P): 1:00000002:00000001:0000018c:00000004:10 Mar 16 10:54:36 ha1 kernel: drbd0: Peer(S): 1:00000002:00000001:0000018d:00000003:10 Mar 16 10:54:36 ha1 kernel: drbd0: Current Primary shall become sync TARGET! Aborting to prevent data corruption. Mar 16 10:54:36 ha1 kernel: drbd0: drbd0_receiver [2969]: cstate WFReportParams --> StandAlone Mar 16 10:54:36 ha1 kernel: drbd0: error receiving ReportParams, l: 72! Mar 16 10:54:36 ha1 kernel: drbd0: asender terminated Mar 16 10:54:36 ha1 kernel: drbd0: worker terminated Mar 16 10:54:36 ha1 kernel: drbd0: drbd0_receiver [2969]: cstate StandAlone --> StandAlone Mar 16 10:54:36 ha1 kernel: drbd0: Connection lost. Mar 16 10:54:36 ha1 kernel: drbd0: receiver terminated The secondary node went into WFConnection mode: ha2# drbdadm cstate all WFConnection This went on for 6 hours before it was caught and a good bit of data was written to the primary disk in StandAlone mode. The only reason why it was caught was that we conducted our monthly failover test with heartbeat. On the secondary server, we issued a: ha2# ./hb_takeover It was at this point that we noticed the data on the secondary node was out of sync and did not contain the recent changes. We flipped the servers back with: ha2# ./hb_standby From what I have read on mailing lists, both drbd devices have been modified and it is unclear exactly which device should sync to which. In our case, we wanted the data on the primary (in StandAlone) to sync back to the secondary. At this point we figured (incorrectly in hindsight, "Current Primary shall become sync TARGET!" is pretty straightforward) that we could simply unmount (/data mounted on /dev/drbd0) everything on the primary node and restart drbd: ha1# umount /data ha1# service drbd restart This hurt bad as the secondary node (with the out of sync data) zapped all the changes on the primary. The sync went the wrong way and we lost a couple hours of data. So my questions are: 1) How did this happen? 2) How do we make sure under all circumstances that we force the sync to go the right way? I found a posting that suggested the following commands. Being that we goofed once already on a production system, we can't afford to have it happen again: ha2# drbdadm invalidate all ha1# drbdadm connect all Will this restart the listener on ha1 and sync ha1 to ha2? Thanks, Darren