[DRBD-user] Replication problems constants with DRBD 8.3.10

Lars Ellenberg lars.ellenberg at linbit.com
Fri Jun 28 11:46:53 CEST 2013

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.

On Sat, Jun 15, 2013 at 01:29:40AM -0700, cesar wrote:
> Hello everyone
> *Please Urgent, my servers are in production*
> I am in a serious problem and need help


On Sun, Jun 16, 2013 at 02:13:45PM -0700, cesar wrote:
> About of call to linbit, i don't need fix this immediately (thanls Digimer
> for your interest :-)

So what is it now? urgent, or just "curious"? ;-)

> Jun 14 08:50:12 kvm5 kernel: block drbd0:
> Digest mismatch, buffer modified by upper layers during write: 21158352s +4096

There it is. In plain english.
  *upper layers*
  *while in flight*

   App writes

   DRBD sees that write, starts to process it,
	submits to disk,
        calculates first checksum
	sends to tcp

   App keeps modifying the write buffers

   DRBD calculates second checksum,
        checksum mismatch :-(

So changing hardware is very unlikely to help.
(even though similar/the same symptoms can also be caused by bad hardware).

This has been discussed on the mailing list before
(several times, threads may be years ago).

Not much DRBD can do here.

If you tell it to be strict with those additional checksums,
it will disconnect, and try to reconnect.
Without fencing, in a dual-primary setup, the disconnect
immediately leads to data divergence, aka split-brain.
With fencing, you typically hard-reset at least one node in this case.

With special purpose built fencing handlers,
we may be able to fix your setup so it will freeze IO during the
disconnected period, reconnect, and replay pending buffers,
without any reset.

With what you call "manual fencing", you need to shell in there,
and cleanup the mess manually, every time this happens.

Or tell DRBD not to notice what those "upper layers" are doing,
hoping that whatever it is, it is supposed to be valid.

Or just don't do dual-primary.

Better yet:
fix those "upper layers" to not do what they are doing,
enable the checksum if it makes you feel good,
do single-primary anyways,
and still add fencing.

: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

More information about the drbd-user mailing list