[DRBD-user] DRBD is passing I/O-error to upper layer, but should not

Matthias Hensler lists-drbd at wspse.de
Sat Jan 26 18:19:44 CET 2013

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hi,

given the following setup:

* two identical systems running CentOS 6.3 with kernel
  2.6.32-279.19.1.el6.x86_64 and DRBD 8.3.15 (from elrepo.org)
* dedicated harddisks with only one primary partition used for a LVM
  physical volume, which is the base for a volume group (called "Disk01"
  on the primary, and "Disk02" on the secondary host).
* Each volumegroup has several logical volumes, with each lv connected
  to a different DRBD-device.
* Each DRBD-device is then connected to a Qemu/KVM virtual machine as
  primary storage device.

DRBD is configured, or should be at least, to detach on I/O-errors. From
my understanding in that scenario a failed disk should be completly
transparent to the virtual machines. However, it turns out that this is
not the case. Please help me to understand why, or what is configured
wrong.


What I was seeing today: the disk on the primary host failed completly.
According to the logfiles DRBD did indeed switch to diskless-mode.
However, all virtual-machines that had DRBD-devices on this disk got an
I/O-error as well, resulting in r/o-mount of the filesystems inside the
VMs. All VMs had to be rebooted, and I am still wondering why.


So, lets recap the configuration for one of the affected virtual-machine.

We have:
 * "lisa": hostname of primary-host
 * "bart": hostname of secondary-host
 * "sk_aw_devel": name of the virtualmachine, the DRBD-resource and the
                  logical LVM volume
 * "Disk01": Volumegroup on the primary-host
 * "Disk02": Volumegroup on the secondary-host

So: on lisa we have:
  /dev/sdc1 -> LVM VG "Disk01" -> LVM LV "sk_aw_devel" -> /dev/drbd20 -> Qemu

  on bart, very similar:
  /dev/sdd1 -> LVM VG "Disk02" -> LVM LV "sk_aw_devel" -> /dev/drbd20 -> not connected


/etc/drbd.conf  (on primary and secondary host):
----- Start of /etc/drbd.conf -----
include "drbd.d/global_common.conf";
include "drbd.d/*.res";
----- End of /etc/drbd.conf -----


/etc/drbd.d/global_common.conf  (on primary and secondary):
----- Start of /etc/drbd.d/global_common.conf -----
global {
  usage-count yes;
  minor-count 128;
}

common {
  protocol C;		# always use protocol C

  startup {
    wfc-timeout 90;
    degr-wfc-timeout 30;
    outdated-wfc-timeout 30;
  }

  disk {
    on-io-error detach;
  }

  net {
    after-sb-0pri discard-zero-changes;
    after-sb-1pri discard-secondary;
    after-sb-2pri disconnect;
    sndbuf-size 0;
    max-epoch-size 8000;
    max-buffers 8000;

    cram-hmac-alg sha1;
    shared-secret "<- secret ->";

    # Cubbi, 2012/06/30: causes high CPU-load and is not recommended on a
    #                    production system. Better regulary run online-verify
    #data-integrity-alg crc32c;
  }

  syncer {
    verify-alg sha1;
    rate 40M;
    al-extents 3389;   # default is 127, should be a prime number
  }
}
----- End of /etc/drbd.d/global_common.conf -----


/etc/drbd.d/sk_aw_devel.res  (on both sides):
----- Start of /etc/drbd.d/sk_aw_devel.res -----
resource sk_aw_devel {
  meta-disk internal;
  device /dev/drbd20;


  net {
    allow-two-primaries;
  }


  on lisa {
    disk /dev/mapper/Disk01-sk_aw_devel;
    address 10.0.0.1:7820;
  }

  on bart {
    disk /dev/mapper/Disk02-sk_aw_devel;
    address 10.0.1.1:7820;
  }
}
----- End of /etc/drbd.d/sk_aw_devel.res -----


This is the output (shortend) of "drbdadm dump" on the primary host:
----- Start of drbdadm dump -----
# /etc/drbd.conf
global {
    minor-count 128;
}

common {
    protocol               C;
    net {
        after-sb-0pri    discard-zero-changes;
        after-sb-1pri    discard-secondary;
        after-sb-2pri    disconnect;
        sndbuf-size        0;
        max-epoch-size   8000;
        max-buffers      8000;
        cram-hmac-alg    sha1;
        shared-secret    <cleared>;
    }
    disk {
        on-io-error      detach;
    }
    syncer {
        verify-alg       sha1;
        rate             40M;
        al-extents       3389;
    }
    startup {
        wfc-timeout       90;
        degr-wfc-timeout  30;
        outdated-wfc-timeout  30;
    }
}

# resource sk_aw_devel on lisa: not ignored, not stacked
resource sk_aw_devel {
    on lisa {
        device           /dev/drbd20 minor 20;
        disk             /dev/mapper/Disk01-sk_aw_devel;
        address          ipv4 10.0.0.1:7820;
        meta-disk        internal;
    }
    on bart {
        device           /dev/drbd20 minor 20;
        disk             /dev/mapper/Disk02-sk_aw_devel;
        address          ipv4 10.0.1.1:7820;
        meta-disk        internal;
    }
    net {
        allow-two-primaries;
    }
}
[...]
----- End of drbdadm dump -----


The corresponding virtual-machine uses /dev/drbd20 for its storage:
----- Start of part of /etc/libvirt/qemu/sk_aw_devel.xml -----
[...]
    <disk type='block' device='disk'>
      <driver name='qemu' type='raw' cache='none' io='native'/>
      <source dev='/dev/drbd20'/>
      <target dev='vda' bus='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </disk>
[...]
----- End of part of /etc/libvirt/qemu/sk_aw_devel.xml -----


Now the logfiles. The problem started as 15:31:21 on the primary host:
Jan 26 15:31:21 lisa kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Jan 26 15:31:21 lisa kernel: ata3.00: failed command: FLUSH CACHE EXT
Jan 26 15:31:21 lisa kernel: ata3.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
Jan 26 15:31:21 lisa kernel:         res 40/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x4 (timeout)
Jan 26 15:31:21 lisa kernel: ata3.00: status: { DRDY }
Jan 26 15:31:21 lisa kernel: ata3: hard resetting link
Jan 26 15:31:26 lisa kernel: ata3: link is slow to respond, please be patient (ready=0)
Jan 26 15:31:31 lisa kernel: ata3: COMRESET failed (errno=-16)
Jan 26 15:31:31 lisa kernel: ata3: hard resetting link
Jan 26 15:31:36 lisa kernel: ata3: link is slow to respond, please be patient (ready=0)
Jan 26 15:31:41 lisa kernel: ata3: COMRESET failed (errno=-16)
Jan 26 15:31:41 lisa kernel: ata3: hard resetting link
Jan 26 15:31:46 lisa kernel: ata3: link is slow to respond, please be patient (ready=0)
Jan 26 15:32:16 lisa kernel: ata3: COMRESET failed (errno=-16)
Jan 26 15:32:16 lisa kernel: ata3: limiting SATA link speed to 1.5 Gbps
Jan 26 15:32:16 lisa kernel: ata3: hard resetting link
Jan 26 15:32:21 lisa kernel: ata3: COMRESET failed (errno=-16)
Jan 26 15:32:21 lisa kernel: ata3: reset failed, giving up
Jan 26 15:32:21 lisa kernel: ata3.00: disabled
Jan 26 15:32:21 lisa kernel: ata3.00: device reported invalid CHS sector 0
Jan 26 15:32:21 lisa kernel: ata3: EH complete
Jan 26 15:32:21 lisa kernel: sd 2:0:0:0: [sdc] Unhandled error code
Jan 26 15:32:21 lisa kernel: sd 2:0:0:0: [sdc] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jan 26 15:32:21 lisa kernel: sd 2:0:0:0: [sdc] CDB: Read(10): 28 00 60 57 aa b8 00 00 08 00
Jan 26 15:32:21 lisa kernel: block drbd10: local READ IO error sector 751278776+8 on dm-11
Jan 26 15:32:21 lisa kernel: block drbd10: disk( UpToDate -> Failed ) 
Jan 26 15:32:21 lisa kernel: block drbd10: Local IO failed in __req_mod. Detaching...
Jan 26 15:32:21 lisa kernel: sd 2:0:0:0: [sdc] Unhandled error code
Jan 26 15:32:21 lisa kernel: sd 2:0:0:0: [sdc] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jan 26 15:32:21 lisa kernel: sd 2:0:0:0: [sdc] CDB: Write(10): 2a 00 ed 87 9c b0 00 01 00 00
Jan 26 15:32:21 lisa kernel: block drbd20: local WRITE IO error sector 5737648+80 on dm-15
Jan 26 15:32:21 lisa kernel: block drbd20: disk( UpToDate -> Failed ) 
Jan 26 15:32:21 lisa kernel: block drbd20: Local IO failed in __req_mod. Detaching...
Jan 26 15:32:21 lisa kernel: sd 2:0:0:0: [sdc] Unhandled error code
Jan 26 15:32:21 lisa kernel: sd 2:0:0:0: [sdc] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jan 26 15:32:21 lisa kernel: sd 2:0:0:0: [sdc] CDB: Read(10): 28
Jan 26 15:32:21 lisa kernel: sd 2:0:0:0: [sdc] Unhandled error code
Jan 26 15:32:21 lisa kernel: 00 60
Jan 26 15:32:21 lisa kernel: sd 2:0:0:0: [sdc]  5aResult: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jan 26 15:32:21 lisa kernel: 4d
Jan 26 15:32:21 lisa kernel: sd 2:0:0:0: [sdc] CDB:  20Write(10) 00: 00 2a 70 00 00 4c
Jan 26 15:32:21 lisa kernel: c6 f4 5c 00
Jan 26 15:32:21 lisa kernel: block drbd10: local READ IO error sector 751451424+112 on dm-11
Jan 26 15:32:21 lisa kernel: 03 c0 00
Jan 26 15:32:21 lisa kernel: sd 2:0:0:0: [sdc] Unhandled error code
Jan 26 15:32:21 lisa kernel: sd 2:0:0:0: [sdc] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jan 26 15:32:21 lisa kernel: sd 2:0:0:0: [sdc] CDB: Read(10): 28 00 60 5a 5a 10 00 00 10 00
Jan 26 15:32:21 lisa kernel: sd 2:0:0:0: [sdc] Unhandled error code
Jan 26 15:32:21 lisa kernel: sd 2:0:0:0: [sdc] 
Jan 26 15:32:21 lisa kernel: sd 2:0:0:0: [sdc] Unhandled error code
Jan 26 15:32:21 lisa kernel: sd 2:0:0:0: [sdc] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jan 26 15:32:21 lisa kernel: sd 2:0:0:0: [sdc] CDB: Write(10): 2a 00 ed 93 c0 00 00 00 08 00
Jan 26 15:32:21 lisa kernel: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jan 26 15:32:21 lisa kernel: sd 2:0:0:0: [sdc] CDB: Read(10): 28 00 64 7f 15 30 00 00 08 00
Jan 26 15:32:21 lisa kernel: sd 2:0:0:0: [sdc] Unhandled error code
Jan 26 15:32:21 lisa kernel: sd 2:0:0:0: [sdc] Unhandled error code
Jan 26 15:32:21 lisa kernel: sd 2:0:0:0: [sdc] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jan 26 15:32:21 lisa kernel: sd 2:0:0:0: [sdc] CDB: Write(10): 2a 00 ed c7 d5 b8 00 00 88 00
Jan 26 15:32:21 lisa kernel: sd 2:0:0:0: [sdc] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jan 26 15:32:21 lisa kernel: sd 2:0:0:0: [sdc] CDB: Read(10): 28 00 66 64 0a 80 00 00 80 00
Jan 26 15:32:21 lisa kernel: sd 2:0:0:0: [sdc] Unhandled error code
Jan 26 15:32:21 lisa kernel: sd 2:0:0:0: [sdc] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jan 26 15:32:21 lisa kernel: sd 2:0:0:0: [sdc] CDB: Write(10): 2a 00 23 5b 35 10 00 00 08 00
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 593179920
Jan 26 15:32:21 lisa kernel: block drbd12: disk( UpToDate -> Failed ) 
Jan 26 15:32:21 lisa kernel: block drbd12: Local IO failed in __req_mod. Detaching...
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 2981127976
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 1288108060
Jan 26 15:32:21 lisa kernel: block drbd9: disk( UpToDate -> Failed ) 
Jan 26 15:32:21 lisa kernel: block drbd9: Local IO failed in __req_mod. Detaching...
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 362774536
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 3992591872
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 4005401800
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 354019696
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 358813736
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 358813888
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 1288109060
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 368273424
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 394919584
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 1288110084
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 3979594368
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 1309972064
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 1318585072
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 1337646128
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 1687792736
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 1728323368
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 1967019240
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 3985083456
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 3985089856
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 3985957176
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 574321536
Jan 26 15:32:21 lisa kernel: block drbd12: IO ERROR: neither local nor remote data, sector 0+0
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 1286082789
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 4005310816
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 4006853736
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 358813704
Jan 26 15:32:21 lisa kernel: block drbd20: bitmap WRITE of 0 pages took 0 jiffies
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 358813776
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 358813800
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 358813856
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 366970208
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 366971320
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 374926224
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 340784640
Jan 26 15:32:21 lisa kernel: block drbd8: disk( UpToDate -> Failed ) 
Jan 26 15:32:21 lisa kernel: block drbd8: Local IO failed in __req_mod. Detaching...
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 340784640
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 340784640
Jan 26 15:32:21 lisa kernel: block drbd20: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
Jan 26 15:32:21 lisa kernel: block drbd20: disk( Failed -> Diskless ) 
Jan 26 15:32:21 lisa kernel: block drbd12: IO ERROR: neither local nor remote data, sector 0+0
Jan 26 15:32:21 lisa kernel: block drbd9: IO ERROR: neither local nor remote data, sector 0+0
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 362774560
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 2971150376
Jan 26 15:32:21 lisa kernel: block drbd9: bitmap WRITE of 0 pages took 0 jiffies
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 340784640
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 340784640
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 157279776
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 158624536
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 160645136
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 185811256
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 185820912
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 185820968
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 206782472
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 269974632
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 273891744
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 273892432
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 273892472
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 274216616
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 274226968
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 274256640
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 274262008
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 275472408
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 340784640
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 152584968
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 152584992
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 152585008
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 275472520
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 282806328
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 283335816
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 283335832
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 283335864
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 283336032
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 284498672
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 284498848
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 284498888
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 285716864
Jan 26 15:32:21 lisa kernel: block drbd9: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
Jan 26 15:32:21 lisa kernel: block drbd9: disk( Failed -> Diskless ) 
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 366167312
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 285716976
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 285716992
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 352796664
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 352861736
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 352862352
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 352862368
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 157246600
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 352862632
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 352862648
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 352870224
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 352871128
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 352894944
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 353002288
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 353004776
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 353184616
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 353184744
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 2987606016
Jan 26 15:32:21 lisa kernel: block drbd11: disk( UpToDate -> Failed ) 
Jan 26 15:32:21 lisa kernel: block drbd11: Local IO failed in __req_mod. Detaching...
[...]
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 143946712
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 143949520
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 143953480
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 340791216
Jan 26 15:32:21 lisa kernel: block drbd8: IO ERROR -5 on bitmap page idx 799
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 2962190144
Jan 26 15:32:21 lisa kernel: block drbd10: IO ERROR -5 on bitmap page idx 2865
Jan 26 15:32:21 lisa kernel: block drbd11: bitmap WRITE of 0 pages took 0 jiffies
Jan 26 15:32:21 lisa kernel: block drbd10: bitmap WRITE of 1 pages took 0 jiffies
Jan 26 15:32:21 lisa kernel: block drbd10: we had at least one MD IO ERROR during bitmap IO
Jan 26 15:32:21 lisa kernel: block drbd11: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
Jan 26 15:32:21 lisa kernel: block drbd11: disk( Failed -> Diskless ) 
Jan 26 15:32:21 lisa kernel: block drbd10: 4 KB (1 bits) marked out-of-sync by on disk bit-map.
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 2962231288
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 2962231288
Jan 26 15:32:21 lisa kernel: block drbd10: drbd_md_sync_page_io(,2097151992s,WRITE) failed!
Jan 26 15:32:21 lisa kernel: block drbd10: meta data update failed!
Jan 26 15:32:21 lisa kernel: block drbd10: disk( Failed -> Diskless ) 
Jan 26 15:32:21 lisa kernel: block drbd8: bitmap WRITE of 1 pages took 1 jiffies
Jan 26 15:32:21 lisa kernel: block drbd8: we had at least one MD IO ERROR during bitmap IO
Jan 26 15:32:21 lisa kernel: block drbd8: 4 KB (1 bits) marked out-of-sync by on disk bit-map.
Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 340791288
Jan 26 15:32:21 lisa kernel: block drbd8: drbd_md_sync_page_io(,209715192s,WRITE) failed!
Jan 26 15:32:21 lisa kernel: block drbd8: meta data update failed!
Jan 26 15:32:21 lisa kernel: block drbd8: disk( Failed -> Diskless ) 
Jan 26 15:32:21 lisa kernel: block drbd12: bitmap WRITE of 0 pages took 0 jiffies
Jan 26 15:32:21 lisa kernel: block drbd12: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
Jan 26 15:32:21 lisa kernel: block drbd12: disk( Failed -> Diskless ) 
Jan 26 15:32:21 lisa kernel: block drbd10: Should have called drbd_al_complete_io(, 423028524), but my Disk seems to have failed :(
Jan 26 15:32:21 lisa kernel: block drbd10: Should have called drbd_al_complete_io(, 423028644), but my Disk seems to have failed :(
Jan 26 15:32:21 lisa kernel: block drbd10: Should have called drbd_al_complete_io(, 423028732), but my Disk seems to have failed :(
Jan 26 15:32:21 lisa kernel: block drbd8: receiver updated UUIDs to effective data uuid: 1A84D3096787250A
Jan 26 15:32:21 lisa kernel: block drbd10: receiver updated UUIDs to effective data uuid: D30492119F43C658
Jan 26 15:32:21 lisa kernel: block drbd12: receiver updated UUIDs to effective data uuid: 7127576F45252CCE
Jan 26 15:32:21 lisa kernel: block drbd20: receiver updated UUIDs to effective data uuid: FCD845AF26021866
Jan 26 15:32:21 lisa kernel: block drbd11: receiver updated UUIDs to effective data uuid: B1491E6E049252D6
Jan 26 15:32:21 lisa kernel: block drbd9: receiver updated UUIDs to effective data uuid: 5FBD52748069ACE8
Jan 26 15:32:22 lisa kernel: end_request: I/O error, dev sdc, sector 340791168
Jan 26 15:32:22 lisa kernel: end_request: I/O error, dev sdc, sector 340791280
Jan 26 15:32:22 lisa kernel: end_request: I/O error, dev sdc, sector 131076096
Jan 26 15:32:22 lisa kernel: end_request: I/O error, dev sdc, sector 131076104
Jan 26 15:32:22 lisa kernel: end_request: I/O error, dev sdc, sector 131076096
Jan 26 15:32:22 lisa kernel: end_request: I/O error, dev sdc, sector 340791168
Jan 26 15:32:22 lisa kernel: end_request: I/O error, dev sdc, sector 340791168
Jan 26 15:32:22 lisa kernel: end_request: I/O error, dev sdc, sector 340791168
[... continues for a while ...]
Jan 26 15:32:48 lisa kernel: end_request: I/O error, dev sdc, sector 5860532208
Jan 26 15:32:48 lisa kernel: end_request: I/O error, dev sdc, sector 2048
Jan 26 15:32:48 lisa kernel: end_request: I/O error, dev sdc, sector 2056
Jan 26 15:32:48 lisa kernel: end_request: I/O error, dev sdc, sector 2048
Jan 26 15:32:51 lisa kernel: block drbd10: 1 messages suppressed in /home/phil/rpmbuild/BUILD/drbd-8.3.15/drbd/drbd_req.c:959.
Jan 26 15:32:51 lisa kernel: block drbd10: IO ERROR: neither local nor remote data, sector 0+0
Jan 26 15:32:51 lisa kernel: block drbd10: IO ERROR: neither local nor remote data, sector 0+0
Jan 26 15:32:51 lisa kernel: block drbd10: IO ERROR: neither local nor remote data, sector 0+0
Jan 26 15:33:13 lisa kernel: block drbd10: 9 messages suppressed in /home/phil/rpmbuild/BUILD/drbd-8.3.15/drbd/drbd_req.c:959.
Jan 26 15:33:13 lisa kernel: block drbd10: IO ERROR: neither local nor remote data, sector 0+0
Jan 26 15:33:13 lisa kernel: block drbd10: IO ERROR: neither local nor remote data, sector 0+0
Jan 26 15:33:13 lisa kernel: block drbd10: IO ERROR: neither local nor remote data, sector 0+0
Jan 26 15:33:20 lisa kernel: block drbd10: 9 messages suppressed in /home/phil/rpmbuild/BUILD/drbd-8.3.15/drbd/drbd_req.c:959.
Jan 26 15:33:20 lisa kernel: block drbd10: IO ERROR: neither local nor remote data, sector 0+0
Jan 26 15:33:48 lisa kernel: __ratelimit: 82 callbacks suppressed
Jan 26 15:33:48 lisa kernel: sd 2:0:0:0: [sdc] Unhandled error code
Jan 26 15:33:48 lisa kernel: sd 2:0:0:0: [sdc] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jan 26 15:33:48 lisa kernel: sd 2:0:0:0: [sdc] CDB: Read(10): 28 00 14 50 0f 80 00 00 08 00
Jan 26 15:33:48 lisa kernel: __ratelimit: 82 callbacks suppressed
Jan 26 15:33:48 lisa kernel: sd 2:0:0:0: [sdc] Unhandled error code
Jan 26 15:33:48 lisa kernel: sd 2:0:0:0: [sdc] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jan 26 15:33:48 lisa kernel: sd 2:0:0:0: [sdc] CDB: Read(10): 28 00 14 50 0f f0 00 00 08 00
Jan 26 15:33:48 lisa kernel: sd 2:0:0:0: [sdc] Unhandled error code
Jan 26 15:33:48 lisa kernel: sd 2:0:0:0: [sdc] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jan 26 15:33:48 lisa kernel: sd 2:0:0:0: [sdc] CDB: Read(10): 28 00 07 d0 10 00 00 00 08 00
Jan 26 15:33:48 lisa kernel: sd 2:0:0:0: [sdc] Unhandled error code
Jan 26 15:33:48 lisa kernel: sd 2:0:0:0: [sdc] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[...]


Messages above will continue, I think you get the idea. At this point
the DRBD is "diskless" on the primary host and "uptodate" on the
secondary host.

The secondary host has logged much less:
Jan 26 15:32:21 bart kernel: block drbd10: pdsk( UpToDate -> Failed ) 
Jan 26 15:32:21 bart kernel: block drbd10: new current UUID D30492119F43C658:2EE09789BCC1DBCC:8085F400424C1059:8084F400424C1059
Jan 26 15:32:21 bart kernel: block drbd12: pdsk( UpToDate -> Failed ) 
Jan 26 15:32:21 bart kernel: block drbd12: new current UUID 7127576F45252CCE:BD7FD0BA1A43FE42:ADAC298F388B40A3:ADAB298F388B40A3
Jan 26 15:32:21 bart kernel: block drbd9: pdsk( UpToDate -> Failed ) 
Jan 26 15:32:21 bart kernel: block drbd9: new current UUID 5FBD52748069ACE8:5B0C1B6E2A3B9B9C:78E9A478FB9F1C15:78E8A478FB9F1C15
Jan 26 15:32:21 bart kernel: block drbd20: pdsk( UpToDate -> Failed ) 
Jan 26 15:32:21 bart kernel: block drbd20: new current UUID FCD845AF26021866:9D3F0E7D50FC1B10:A380AAB511DEC817:A37FAAB511DEC817
Jan 26 15:32:21 bart kernel: block drbd8: pdsk( UpToDate -> Failed ) 
Jan 26 15:32:21 bart kernel: block drbd8: new current UUID 1A84D3096787250A:4D4373FABBD8A3B8:11274BE647BFD1FB:11264BE647BFD1FB
Jan 26 15:32:21 bart kernel: block drbd11: pdsk( UpToDate -> Failed ) 
Jan 26 15:32:21 bart kernel: block drbd11: new current UUID B1491E6E049252D6:AA40938D0A56F594:F0AA427AEE833AB1:F0A9427AEE833AB1
Jan 26 15:32:21 bart kernel: block drbd10: pdsk( Failed -> Diskless ) 
Jan 26 15:32:21 bart kernel: block drbd11: pdsk( Failed -> Diskless ) 
Jan 26 15:32:21 bart kernel: block drbd8: pdsk( Failed -> Diskless ) 
Jan 26 15:32:21 bart kernel: block drbd20: pdsk( Failed -> Diskless ) 
Jan 26 15:32:21 bart kernel: block drbd9: pdsk( Failed -> Diskless ) 
Jan 26 15:32:21 bart kernel: block drbd12: pdsk( Failed -> Diskless ) 


So, DRBD changed from "UpToDate" -> "Failed" -> "Diskless" on the
primary host and continued to serve data from the secondary host. That
is what I expect.

What I did not expect, is that the virtual machine actually noticed this
problem. On the console of one machine I could read "aborting journal"
and all filesystems inside the VM were mounted readonly.

From my understanding that behaviour should only occur on "on-io-error
pass-on", but not for "detach".

And even in the diskless-mode there were a lot of DRBD messages logged
on the primary side, much later:
Jan 26 15:42:35 lisa kernel: end_request: I/O error, dev sdc, sector 0
Jan 26 15:42:35 lisa kernel: end_request: I/O error, dev sdc, sector 8
Jan 26 15:42:35 lisa kernel: end_request: I/O error, dev sdc, sector 0
Jan 26 15:42:35 lisa kernel: end_request: I/O error, dev sdc, sector 5860532096
Jan 26 15:42:35 lisa kernel: end_request: I/O error, dev sdc, sector 5860532208
Jan 26 15:42:35 lisa kernel: end_request: I/O error, dev sdc, sector 2048
Jan 26 15:42:35 lisa kernel: end_request: I/O error, dev sdc, sector 2056
Jan 26 15:42:35 lisa kernel: end_request: I/O error, dev sdc, sector 2048
Jan 26 15:43:05 lisa kernel: sd 2:0:0:0: [sdc] Synchronizing SCSI cache
Jan 26 15:43:05 lisa kernel: sd 2:0:0:0: [sdc] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jan 26 15:43:05 lisa kernel: sd 2:0:0:0: [sdc] Stopping disk
Jan 26 15:43:05 lisa kernel: sd 2:0:0:0: [sdc] START_STOP FAILED
Jan 26 15:43:05 lisa kernel: sd 2:0:0:0: [sdc] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jan 26 15:43:05 lisa kernel: ata3: hard resetting link
Jan 26 15:43:10 lisa kernel: ata3: link is slow to respond, please be patient (ready=0)
Jan 26 15:43:13 lisa kernel: block drbd10: 17 messages suppressed in /home/phil/rpmbuild/BUILD/drbd-8.3.15/drbd/drbd_req.c:959.
Jan 26 15:43:13 lisa kernel: block drbd10: IO ERROR: neither local nor remote data, sector 0+0
Jan 26 15:43:13 lisa kernel: block drbd10: IO ERROR: neither local nor remote data, sector 0+0
Jan 26 15:43:13 lisa kernel: block drbd10: IO ERROR: neither local nor remote data, sector 0+0
Jan 26 15:43:15 lisa kernel: ata3: COMRESET failed (errno=-16)
Jan 26 15:43:15 lisa kernel: ata3: hard resetting link
Jan 26 15:43:20 lisa kernel: ata3: link is slow to respond, please be patient (ready=0)
Jan 26 15:43:21 lisa kernel: block drbd10: 15 messages suppressed in /home/phil/rpmbuild/BUILD/drbd-8.3.15/drbd/drbd_req.c:959.
Jan 26 15:43:21 lisa kernel: block drbd10: IO ERROR: neither local nor remote data, sector 0+0
Jan 26 15:43:25 lisa kernel: ata3: COMRESET failed (errno=-16)
Jan 26 15:43:25 lisa kernel: ata3: hard resetting link
Jan 26 15:43:30 lisa kernel: block drbd10: 17 messages suppressed in /home/phil/rpmbuild/BUILD/drbd-8.3.15/drbd/drbd_req.c:959.
Jan 26 15:43:30 lisa kernel: block drbd10: IO ERROR: neither local nor remote data, sector 0+0
Jan 26 15:43:30 lisa kernel: block drbd10: IO ERROR: neither local nor remote data, sector 0+0
Jan 26 15:43:31 lisa kernel: ata3: link is slow to respond, please be patient (ready=0)
Jan 26 15:44:00 lisa kernel: ata3: COMRESET failed (errno=-16)
Jan 26 15:44:00 lisa kernel: ata3: hard resetting link
Jan 26 15:44:05 lisa kernel: ata3: COMRESET failed (errno=-16)
Jan 26 15:44:05 lisa kernel: ata3: reset failed, giving up
Jan 26 15:44:05 lisa kernel: ata3: EH complete


In summary: why did DRBD pass the local I/O-error to the upper-layer?
What can be done to prevent this?

Of course if any additional information are needed, I am happy to
provide them.

Regards,
Matthias
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 308 bytes
Desc: not available
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20130126/e90428c2/attachment.pgp>


More information about the drbd-user mailing list