[DRBD-user] drbd8.3.8 re-sync bug?(fixed some error in grammar and logic...)

Mon Jul 9 09:51:02 CEST 2012

On Thu, Jul 05, 2012 at 06:21:12PM +0800, 郭宇 wrote:
> We possess two redhat x86_64 virtual servers, as mysql db servers
> called 'master' and 'backup', installed drbd8.3.8 and heartbeat
> 2.1.3-3, each has two drbd patitions for mysql's log and data.
>
> Our network status was bad, so heartbeat always changes the two
> servers' drbd status between 'primary' and 'secondary'.
>
> One day, the drbd split brain occured, the two servers treated
> themself as the 'primary' at the same time(it's may because of my
> wrong heartbeat configure). And then the 'backup' server could not be
> connected by all the way(I had made log for the system status
> regularly, the log showed normally, so the problem trapped me till now
> ...).

So you managed to go into "split brain" once,
and never resolved that.
You had diverging data sets, and no replication.

> Then we rebooted the 'backup' server, heartbeat and drbd program
> started automatically. 
>
> Then the problem was coming, data on 'master' server which was
> 'primary' drbd server was full synced with 'backup' server,

I doubt that.
Unless you configured very risky auto split brain recovery policies.

More like, there was still no replication, and by switching nodes, you
simply had your mysql suddenly use a completely different, outdated data set.

> and then
> the 'master' server's drbd partitions were still mounted, and mysql
> program still worked, but the file was completely changed! Lots of my
> data on 'master' server was lost!!!

> Then, I checked the /proc/drbd, 'master' server's status was
> 'Primary/Unknown', 'backup' server's status was
> 'Secondary/Unknown'('backup' server's status may be changed by
> heartbeat).

See.
No replication at all.
So I guess there was no resync either.

You still have two independend versions of your data,
probably dating back to where you had that "split brain".

> So I think if it's a bug on drbd8.3.8?

I don't think so.

Rather in your way of using it.

You should have fencing policies and fence-peer handlers, and you need
to help DRBD to resolve data divergence if it should still occur despite
properly configured fencing.

To reiterate:
You need fencing to help avoid data divergence ("split brain").
You need monitoring (beyond the Pacemaker "monitor" action) to recognize
a lost replication link in time.
You may need to help DRBD to resolve data divergence, should it still occur.

	Lars

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com