[DRBD-user] DRBD module crash with KVM avec LVM

Thu Feb 19 12:52:47 CET 2015

On Wed, Feb 18, 2015 at 12:19:43PM +0100, Julien Escario wrote:
> Hello,
> We are currently experiencing a strange problem with DRBD :
> A few days ago, we got a crash of the drbd with :
> Jan 21 12:13:40 dedie58 ntpd[926790]: ntp engine ready
> Jan 21 12:13:41 dedie58 kernel: block drbd1: drbd1_receiver[2910]
> Concurrent local write detected!    new: 2253364088s +4096; pending:
> 2253364088s +4096
> Jan 21 12:13:41 dedie58 kernel: block drbd1: Concurrent write! [W
> AFTERWARDS] sec=2253364088s
> 
> It happened on node1 just after I synced time on both nodes (yeah, I
> won't repeat this).

That should only be a coincidence.
DBRD does not care for wall clock time at all.

> At this time, we were having VMs running dispatched on both nodes for
> a few days. No VM was running on both node at the same time.

Something wrote to the same sector on both nodes at the same time.
Or you have some sort of memory or other corruption going on
that made it look like that was the case.

> So we rebooted node1, launched all VMs on node2 and asked for a full
> resync of the DRBD device which took 5 days (disks are 7k2 of 4 TB).
> 
> So we tought everything was back to normal but and we moved back a
> non-important VM to node1. It ran as expected for about 8 hours and
> finally, VM crashed around 6h30 PM with the below call trace.
> 
> I checked about fencing, nothing in logs on any node.
> 
> For the background :
> Two Proxmox 3 hypervisors running Linux KVM VMs with LVM disks over
> a DRBD over a software RAID device.

I have yet to see a properly configured proxmox "cluster" on top of DRBD.
I've seen a lot of such installations though, which don't care for data
integrity at all, and cannot ever perform.
These are proof-of-concept only, and work as long as nothing breaks,
but don't hold up in real-world failure scenarios.

At least my experience gave me that impression.

> For the versions :
> 
> # cat /proc/drbd
> version: 8.3.13 (api:88/proto:86-96)

Which is very old, and may well contain bugs/races,
or even simply performance bottlenecks that cause you pain.

Is your overall performance still ok?
Is your raid controller (and its battery) healthy?
Do you even have a raid controller + cache + bbu?

-- 
: Lars Ellenberg
: http://www.LINBIT.com | Your Way to High Availability
: DRBD, Linux-HA  and  Pacemaker support and consulting

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.