[DRBD-user] DRBD is passing I/O-error to upper layer, but should not

Mon Jan 28 18:20:05 CET 2013

Hi,

On Mon, Jan 28, 2013 at 05:57:43PM +0100, Lars Ellenberg wrote:
> On Mon, Jan 28, 2013 at 04:04:32PM +0100, Matthias Hensler wrote:
> > On Mon, Jan 28, 2013 at 03:26:49PM +0100, Felix Frank wrote:
> > > On 01/26/2013 06:19 PM, Matthias Hensler wrote:
> > > > Jan 26 15:32:21 lisa kernel: block drbd12: IO ERROR: neither local nor remote data, sector 0+0
> > > > Jan 26 15:32:21 lisa kernel: block drbd9: IO ERROR: neither local nor remote data, sector 0+0
> > > 
> > > I'm not at all sure about this, but this does seem to indicate
> > > that the peer either has disk problems of its own (not likely) or
> > > otherwise cannot access this specific (first?) sector.
> > 
> > I do not think that there were any disk problems on the peer. In
> > fact I did reboot all virtual machines on the peer side prior to
> > replacing the failed disk. Everything went smooth from there.
> 
> 
> Did you actually see DRBD pass errors up the stack (file systems
> remounting read-only, guest VMs noticing IO error), or have you "only"
> been scared by above log message.

Yes, these errors were definitely passed up the stack. That particulary
disk did hold 7 DRBD devices (Minors 8, 9, 10, 11, 12, 16 and 20). On
top of the DRBDs we had 5 running Linux VMs and one Windows 2008 VM (the
VM with DRBD minor 16 was not active at that moment).

All 5 running Linux machines mounted their filesystems readonly, with
logging an I/O-error and aborting journal. The Windows machine seemed to
work fine for a while, but also aborted running its services in the end.

So yes, the I/O-error was passed up for all DRBD-devices on that disk.

> If the latter, we may need to do some "cosmetic surgery".
> 
> If the former, and this node still had an established connection to a
> healthy peer device, we'd have a "real" bug.

There was an established connection to the peer. I was able to login
with SSH to most of the Linux machines and with rdesktop to the windows
machine. All machines were still able to read their (now r/o-mounted)
filesystems without any issue, while running on the primary host with
the failed disk.

Diskstate for DRBD was "Diskless/UpToDate" on the primary, and
"UpToDate/Diskless" on the secondary side. See logfiles from the initial
posting. If any more details are needed, just ask.

To recover I first live migrated all machines to the secondary host.
Since the filesystems were r/o for all guests, I then rebooted all VMs
on the secondary side (on boot the Linux VMs reported orphaned inodes on
the root filesystem, which I think is expected, but otherwise showed no
problem). The failed harddisk on the primary side was then replaced,
DRBD resynced and the VMs migrated back again.

My expectation would have been, that I did not need to reboot the VMs of
course :)

Essentially, I used the same setup on the affected cluster, as I used on
an older productive system. As far as I remember the older system did
run DRBD 8.3.8, and did not show that behaviour on a disk failure. So it
might well be a regression in DRBD somewhere on the path.

Please let me known if I can help to track this any further, thanks.

Regards,
Matthias
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 308 bytes
Desc: not available
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20130128/8287c311/attachment.pgp>