Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi, On Mon, Jan 28, 2013 at 05:57:43PM +0100, Lars Ellenberg wrote: > On Mon, Jan 28, 2013 at 04:04:32PM +0100, Matthias Hensler wrote: > > On Mon, Jan 28, 2013 at 03:26:49PM +0100, Felix Frank wrote: > > > On 01/26/2013 06:19 PM, Matthias Hensler wrote: > > > > Jan 26 15:32:21 lisa kernel: block drbd12: IO ERROR: neither local nor remote data, sector 0+0 > > > > Jan 26 15:32:21 lisa kernel: block drbd9: IO ERROR: neither local nor remote data, sector 0+0 > > > > > > I'm not at all sure about this, but this does seem to indicate > > > that the peer either has disk problems of its own (not likely) or > > > otherwise cannot access this specific (first?) sector. > > > > I do not think that there were any disk problems on the peer. In > > fact I did reboot all virtual machines on the peer side prior to > > replacing the failed disk. Everything went smooth from there. > > > Did you actually see DRBD pass errors up the stack (file systems > remounting read-only, guest VMs noticing IO error), or have you "only" > been scared by above log message. Yes, these errors were definitely passed up the stack. That particulary disk did hold 7 DRBD devices (Minors 8, 9, 10, 11, 12, 16 and 20). On top of the DRBDs we had 5 running Linux VMs and one Windows 2008 VM (the VM with DRBD minor 16 was not active at that moment). All 5 running Linux machines mounted their filesystems readonly, with logging an I/O-error and aborting journal. The Windows machine seemed to work fine for a while, but also aborted running its services in the end. So yes, the I/O-error was passed up for all DRBD-devices on that disk. > If the latter, we may need to do some "cosmetic surgery". > > If the former, and this node still had an established connection to a > healthy peer device, we'd have a "real" bug. There was an established connection to the peer. I was able to login with SSH to most of the Linux machines and with rdesktop to the windows machine. All machines were still able to read their (now r/o-mounted) filesystems without any issue, while running on the primary host with the failed disk. Diskstate for DRBD was "Diskless/UpToDate" on the primary, and "UpToDate/Diskless" on the secondary side. See logfiles from the initial posting. If any more details are needed, just ask. To recover I first live migrated all machines to the secondary host. Since the filesystems were r/o for all guests, I then rebooted all VMs on the secondary side (on boot the Linux VMs reported orphaned inodes on the root filesystem, which I think is expected, but otherwise showed no problem). The failed harddisk on the primary side was then replaced, DRBD resynced and the VMs migrated back again. My expectation would have been, that I did not need to reboot the VMs of course :) Essentially, I used the same setup on the affected cluster, as I used on an older productive system. As far as I remember the older system did run DRBD 8.3.8, and did not show that behaviour on a disk failure. So it might well be a regression in DRBD somewhere on the path. Please let me known if I can help to track this any further, thanks. Regards, Matthias -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 308 bytes Desc: not available URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20130128/8287c311/attachment.pgp>