Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi, given the following setup: * two identical systems running CentOS 6.3 with kernel 2.6.32-279.19.1.el6.x86_64 and DRBD 8.3.15 (from elrepo.org) * dedicated harddisks with only one primary partition used for a LVM physical volume, which is the base for a volume group (called "Disk01" on the primary, and "Disk02" on the secondary host). * Each volumegroup has several logical volumes, with each lv connected to a different DRBD-device. * Each DRBD-device is then connected to a Qemu/KVM virtual machine as primary storage device. DRBD is configured, or should be at least, to detach on I/O-errors. From my understanding in that scenario a failed disk should be completly transparent to the virtual machines. However, it turns out that this is not the case. Please help me to understand why, or what is configured wrong. What I was seeing today: the disk on the primary host failed completly. According to the logfiles DRBD did indeed switch to diskless-mode. However, all virtual-machines that had DRBD-devices on this disk got an I/O-error as well, resulting in r/o-mount of the filesystems inside the VMs. All VMs had to be rebooted, and I am still wondering why. So, lets recap the configuration for one of the affected virtual-machine. We have: * "lisa": hostname of primary-host * "bart": hostname of secondary-host * "sk_aw_devel": name of the virtualmachine, the DRBD-resource and the logical LVM volume * "Disk01": Volumegroup on the primary-host * "Disk02": Volumegroup on the secondary-host So: on lisa we have: /dev/sdc1 -> LVM VG "Disk01" -> LVM LV "sk_aw_devel" -> /dev/drbd20 -> Qemu on bart, very similar: /dev/sdd1 -> LVM VG "Disk02" -> LVM LV "sk_aw_devel" -> /dev/drbd20 -> not connected /etc/drbd.conf (on primary and secondary host): ----- Start of /etc/drbd.conf ----- include "drbd.d/global_common.conf"; include "drbd.d/*.res"; ----- End of /etc/drbd.conf ----- /etc/drbd.d/global_common.conf (on primary and secondary): ----- Start of /etc/drbd.d/global_common.conf ----- global { usage-count yes; minor-count 128; } common { protocol C; # always use protocol C startup { wfc-timeout 90; degr-wfc-timeout 30; outdated-wfc-timeout 30; } disk { on-io-error detach; } net { after-sb-0pri discard-zero-changes; after-sb-1pri discard-secondary; after-sb-2pri disconnect; sndbuf-size 0; max-epoch-size 8000; max-buffers 8000; cram-hmac-alg sha1; shared-secret "<- secret ->"; # Cubbi, 2012/06/30: causes high CPU-load and is not recommended on a # production system. Better regulary run online-verify #data-integrity-alg crc32c; } syncer { verify-alg sha1; rate 40M; al-extents 3389; # default is 127, should be a prime number } } ----- End of /etc/drbd.d/global_common.conf ----- /etc/drbd.d/sk_aw_devel.res (on both sides): ----- Start of /etc/drbd.d/sk_aw_devel.res ----- resource sk_aw_devel { meta-disk internal; device /dev/drbd20; net { allow-two-primaries; } on lisa { disk /dev/mapper/Disk01-sk_aw_devel; address 10.0.0.1:7820; } on bart { disk /dev/mapper/Disk02-sk_aw_devel; address 10.0.1.1:7820; } } ----- End of /etc/drbd.d/sk_aw_devel.res ----- This is the output (shortend) of "drbdadm dump" on the primary host: ----- Start of drbdadm dump ----- # /etc/drbd.conf global { minor-count 128; } common { protocol C; net { after-sb-0pri discard-zero-changes; after-sb-1pri discard-secondary; after-sb-2pri disconnect; sndbuf-size 0; max-epoch-size 8000; max-buffers 8000; cram-hmac-alg sha1; shared-secret <cleared>; } disk { on-io-error detach; } syncer { verify-alg sha1; rate 40M; al-extents 3389; } startup { wfc-timeout 90; degr-wfc-timeout 30; outdated-wfc-timeout 30; } } # resource sk_aw_devel on lisa: not ignored, not stacked resource sk_aw_devel { on lisa { device /dev/drbd20 minor 20; disk /dev/mapper/Disk01-sk_aw_devel; address ipv4 10.0.0.1:7820; meta-disk internal; } on bart { device /dev/drbd20 minor 20; disk /dev/mapper/Disk02-sk_aw_devel; address ipv4 10.0.1.1:7820; meta-disk internal; } net { allow-two-primaries; } } [...] ----- End of drbdadm dump ----- The corresponding virtual-machine uses /dev/drbd20 for its storage: ----- Start of part of /etc/libvirt/qemu/sk_aw_devel.xml ----- [...] <disk type='block' device='disk'> <driver name='qemu' type='raw' cache='none' io='native'/> <source dev='/dev/drbd20'/> <target dev='vda' bus='virtio'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/> </disk> [...] ----- End of part of /etc/libvirt/qemu/sk_aw_devel.xml ----- Now the logfiles. The problem started as 15:31:21 on the primary host: Jan 26 15:31:21 lisa kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen Jan 26 15:31:21 lisa kernel: ata3.00: failed command: FLUSH CACHE EXT Jan 26 15:31:21 lisa kernel: ata3.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 Jan 26 15:31:21 lisa kernel: res 40/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x4 (timeout) Jan 26 15:31:21 lisa kernel: ata3.00: status: { DRDY } Jan 26 15:31:21 lisa kernel: ata3: hard resetting link Jan 26 15:31:26 lisa kernel: ata3: link is slow to respond, please be patient (ready=0) Jan 26 15:31:31 lisa kernel: ata3: COMRESET failed (errno=-16) Jan 26 15:31:31 lisa kernel: ata3: hard resetting link Jan 26 15:31:36 lisa kernel: ata3: link is slow to respond, please be patient (ready=0) Jan 26 15:31:41 lisa kernel: ata3: COMRESET failed (errno=-16) Jan 26 15:31:41 lisa kernel: ata3: hard resetting link Jan 26 15:31:46 lisa kernel: ata3: link is slow to respond, please be patient (ready=0) Jan 26 15:32:16 lisa kernel: ata3: COMRESET failed (errno=-16) Jan 26 15:32:16 lisa kernel: ata3: limiting SATA link speed to 1.5 Gbps Jan 26 15:32:16 lisa kernel: ata3: hard resetting link Jan 26 15:32:21 lisa kernel: ata3: COMRESET failed (errno=-16) Jan 26 15:32:21 lisa kernel: ata3: reset failed, giving up Jan 26 15:32:21 lisa kernel: ata3.00: disabled Jan 26 15:32:21 lisa kernel: ata3.00: device reported invalid CHS sector 0 Jan 26 15:32:21 lisa kernel: ata3: EH complete Jan 26 15:32:21 lisa kernel: sd 2:0:0:0: [sdc] Unhandled error code Jan 26 15:32:21 lisa kernel: sd 2:0:0:0: [sdc] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK Jan 26 15:32:21 lisa kernel: sd 2:0:0:0: [sdc] CDB: Read(10): 28 00 60 57 aa b8 00 00 08 00 Jan 26 15:32:21 lisa kernel: block drbd10: local READ IO error sector 751278776+8 on dm-11 Jan 26 15:32:21 lisa kernel: block drbd10: disk( UpToDate -> Failed ) Jan 26 15:32:21 lisa kernel: block drbd10: Local IO failed in __req_mod. Detaching... Jan 26 15:32:21 lisa kernel: sd 2:0:0:0: [sdc] Unhandled error code Jan 26 15:32:21 lisa kernel: sd 2:0:0:0: [sdc] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK Jan 26 15:32:21 lisa kernel: sd 2:0:0:0: [sdc] CDB: Write(10): 2a 00 ed 87 9c b0 00 01 00 00 Jan 26 15:32:21 lisa kernel: block drbd20: local WRITE IO error sector 5737648+80 on dm-15 Jan 26 15:32:21 lisa kernel: block drbd20: disk( UpToDate -> Failed ) Jan 26 15:32:21 lisa kernel: block drbd20: Local IO failed in __req_mod. Detaching... Jan 26 15:32:21 lisa kernel: sd 2:0:0:0: [sdc] Unhandled error code Jan 26 15:32:21 lisa kernel: sd 2:0:0:0: [sdc] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK Jan 26 15:32:21 lisa kernel: sd 2:0:0:0: [sdc] CDB: Read(10): 28 Jan 26 15:32:21 lisa kernel: sd 2:0:0:0: [sdc] Unhandled error code Jan 26 15:32:21 lisa kernel: 00 60 Jan 26 15:32:21 lisa kernel: sd 2:0:0:0: [sdc] 5aResult: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK Jan 26 15:32:21 lisa kernel: 4d Jan 26 15:32:21 lisa kernel: sd 2:0:0:0: [sdc] CDB: 20Write(10) 00: 00 2a 70 00 00 4c Jan 26 15:32:21 lisa kernel: c6 f4 5c 00 Jan 26 15:32:21 lisa kernel: block drbd10: local READ IO error sector 751451424+112 on dm-11 Jan 26 15:32:21 lisa kernel: 03 c0 00 Jan 26 15:32:21 lisa kernel: sd 2:0:0:0: [sdc] Unhandled error code Jan 26 15:32:21 lisa kernel: sd 2:0:0:0: [sdc] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK Jan 26 15:32:21 lisa kernel: sd 2:0:0:0: [sdc] CDB: Read(10): 28 00 60 5a 5a 10 00 00 10 00 Jan 26 15:32:21 lisa kernel: sd 2:0:0:0: [sdc] Unhandled error code Jan 26 15:32:21 lisa kernel: sd 2:0:0:0: [sdc] Jan 26 15:32:21 lisa kernel: sd 2:0:0:0: [sdc] Unhandled error code Jan 26 15:32:21 lisa kernel: sd 2:0:0:0: [sdc] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK Jan 26 15:32:21 lisa kernel: sd 2:0:0:0: [sdc] CDB: Write(10): 2a 00 ed 93 c0 00 00 00 08 00 Jan 26 15:32:21 lisa kernel: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK Jan 26 15:32:21 lisa kernel: sd 2:0:0:0: [sdc] CDB: Read(10): 28 00 64 7f 15 30 00 00 08 00 Jan 26 15:32:21 lisa kernel: sd 2:0:0:0: [sdc] Unhandled error code Jan 26 15:32:21 lisa kernel: sd 2:0:0:0: [sdc] Unhandled error code Jan 26 15:32:21 lisa kernel: sd 2:0:0:0: [sdc] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK Jan 26 15:32:21 lisa kernel: sd 2:0:0:0: [sdc] CDB: Write(10): 2a 00 ed c7 d5 b8 00 00 88 00 Jan 26 15:32:21 lisa kernel: sd 2:0:0:0: [sdc] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK Jan 26 15:32:21 lisa kernel: sd 2:0:0:0: [sdc] CDB: Read(10): 28 00 66 64 0a 80 00 00 80 00 Jan 26 15:32:21 lisa kernel: sd 2:0:0:0: [sdc] Unhandled error code Jan 26 15:32:21 lisa kernel: sd 2:0:0:0: [sdc] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK Jan 26 15:32:21 lisa kernel: sd 2:0:0:0: [sdc] CDB: Write(10): 2a 00 23 5b 35 10 00 00 08 00 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 593179920 Jan 26 15:32:21 lisa kernel: block drbd12: disk( UpToDate -> Failed ) Jan 26 15:32:21 lisa kernel: block drbd12: Local IO failed in __req_mod. Detaching... Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 2981127976 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 1288108060 Jan 26 15:32:21 lisa kernel: block drbd9: disk( UpToDate -> Failed ) Jan 26 15:32:21 lisa kernel: block drbd9: Local IO failed in __req_mod. Detaching... Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 362774536 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 3992591872 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 4005401800 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 354019696 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 358813736 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 358813888 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 1288109060 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 368273424 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 394919584 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 1288110084 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 3979594368 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 1309972064 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 1318585072 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 1337646128 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 1687792736 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 1728323368 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 1967019240 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 3985083456 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 3985089856 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 3985957176 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 574321536 Jan 26 15:32:21 lisa kernel: block drbd12: IO ERROR: neither local nor remote data, sector 0+0 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 1286082789 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 4005310816 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 4006853736 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 358813704 Jan 26 15:32:21 lisa kernel: block drbd20: bitmap WRITE of 0 pages took 0 jiffies Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 358813776 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 358813800 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 358813856 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 366970208 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 366971320 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 374926224 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 340784640 Jan 26 15:32:21 lisa kernel: block drbd8: disk( UpToDate -> Failed ) Jan 26 15:32:21 lisa kernel: block drbd8: Local IO failed in __req_mod. Detaching... Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 340784640 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 340784640 Jan 26 15:32:21 lisa kernel: block drbd20: 0 KB (0 bits) marked out-of-sync by on disk bit-map. Jan 26 15:32:21 lisa kernel: block drbd20: disk( Failed -> Diskless ) Jan 26 15:32:21 lisa kernel: block drbd12: IO ERROR: neither local nor remote data, sector 0+0 Jan 26 15:32:21 lisa kernel: block drbd9: IO ERROR: neither local nor remote data, sector 0+0 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 362774560 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 2971150376 Jan 26 15:32:21 lisa kernel: block drbd9: bitmap WRITE of 0 pages took 0 jiffies Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 340784640 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 340784640 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 157279776 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 158624536 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 160645136 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 185811256 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 185820912 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 185820968 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 206782472 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 269974632 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 273891744 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 273892432 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 273892472 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 274216616 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 274226968 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 274256640 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 274262008 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 275472408 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 340784640 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 152584968 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 152584992 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 152585008 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 275472520 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 282806328 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 283335816 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 283335832 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 283335864 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 283336032 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 284498672 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 284498848 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 284498888 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 285716864 Jan 26 15:32:21 lisa kernel: block drbd9: 0 KB (0 bits) marked out-of-sync by on disk bit-map. Jan 26 15:32:21 lisa kernel: block drbd9: disk( Failed -> Diskless ) Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 366167312 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 285716976 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 285716992 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 352796664 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 352861736 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 352862352 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 352862368 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 157246600 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 352862632 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 352862648 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 352870224 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 352871128 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 352894944 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 353002288 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 353004776 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 353184616 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 353184744 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 2987606016 Jan 26 15:32:21 lisa kernel: block drbd11: disk( UpToDate -> Failed ) Jan 26 15:32:21 lisa kernel: block drbd11: Local IO failed in __req_mod. Detaching... [...] Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 143946712 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 143949520 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 143953480 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 340791216 Jan 26 15:32:21 lisa kernel: block drbd8: IO ERROR -5 on bitmap page idx 799 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 2962190144 Jan 26 15:32:21 lisa kernel: block drbd10: IO ERROR -5 on bitmap page idx 2865 Jan 26 15:32:21 lisa kernel: block drbd11: bitmap WRITE of 0 pages took 0 jiffies Jan 26 15:32:21 lisa kernel: block drbd10: bitmap WRITE of 1 pages took 0 jiffies Jan 26 15:32:21 lisa kernel: block drbd10: we had at least one MD IO ERROR during bitmap IO Jan 26 15:32:21 lisa kernel: block drbd11: 0 KB (0 bits) marked out-of-sync by on disk bit-map. Jan 26 15:32:21 lisa kernel: block drbd11: disk( Failed -> Diskless ) Jan 26 15:32:21 lisa kernel: block drbd10: 4 KB (1 bits) marked out-of-sync by on disk bit-map. Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 2962231288 Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 2962231288 Jan 26 15:32:21 lisa kernel: block drbd10: drbd_md_sync_page_io(,2097151992s,WRITE) failed! Jan 26 15:32:21 lisa kernel: block drbd10: meta data update failed! Jan 26 15:32:21 lisa kernel: block drbd10: disk( Failed -> Diskless ) Jan 26 15:32:21 lisa kernel: block drbd8: bitmap WRITE of 1 pages took 1 jiffies Jan 26 15:32:21 lisa kernel: block drbd8: we had at least one MD IO ERROR during bitmap IO Jan 26 15:32:21 lisa kernel: block drbd8: 4 KB (1 bits) marked out-of-sync by on disk bit-map. Jan 26 15:32:21 lisa kernel: end_request: I/O error, dev sdc, sector 340791288 Jan 26 15:32:21 lisa kernel: block drbd8: drbd_md_sync_page_io(,209715192s,WRITE) failed! Jan 26 15:32:21 lisa kernel: block drbd8: meta data update failed! Jan 26 15:32:21 lisa kernel: block drbd8: disk( Failed -> Diskless ) Jan 26 15:32:21 lisa kernel: block drbd12: bitmap WRITE of 0 pages took 0 jiffies Jan 26 15:32:21 lisa kernel: block drbd12: 0 KB (0 bits) marked out-of-sync by on disk bit-map. Jan 26 15:32:21 lisa kernel: block drbd12: disk( Failed -> Diskless ) Jan 26 15:32:21 lisa kernel: block drbd10: Should have called drbd_al_complete_io(, 423028524), but my Disk seems to have failed :( Jan 26 15:32:21 lisa kernel: block drbd10: Should have called drbd_al_complete_io(, 423028644), but my Disk seems to have failed :( Jan 26 15:32:21 lisa kernel: block drbd10: Should have called drbd_al_complete_io(, 423028732), but my Disk seems to have failed :( Jan 26 15:32:21 lisa kernel: block drbd8: receiver updated UUIDs to effective data uuid: 1A84D3096787250A Jan 26 15:32:21 lisa kernel: block drbd10: receiver updated UUIDs to effective data uuid: D30492119F43C658 Jan 26 15:32:21 lisa kernel: block drbd12: receiver updated UUIDs to effective data uuid: 7127576F45252CCE Jan 26 15:32:21 lisa kernel: block drbd20: receiver updated UUIDs to effective data uuid: FCD845AF26021866 Jan 26 15:32:21 lisa kernel: block drbd11: receiver updated UUIDs to effective data uuid: B1491E6E049252D6 Jan 26 15:32:21 lisa kernel: block drbd9: receiver updated UUIDs to effective data uuid: 5FBD52748069ACE8 Jan 26 15:32:22 lisa kernel: end_request: I/O error, dev sdc, sector 340791168 Jan 26 15:32:22 lisa kernel: end_request: I/O error, dev sdc, sector 340791280 Jan 26 15:32:22 lisa kernel: end_request: I/O error, dev sdc, sector 131076096 Jan 26 15:32:22 lisa kernel: end_request: I/O error, dev sdc, sector 131076104 Jan 26 15:32:22 lisa kernel: end_request: I/O error, dev sdc, sector 131076096 Jan 26 15:32:22 lisa kernel: end_request: I/O error, dev sdc, sector 340791168 Jan 26 15:32:22 lisa kernel: end_request: I/O error, dev sdc, sector 340791168 Jan 26 15:32:22 lisa kernel: end_request: I/O error, dev sdc, sector 340791168 [... continues for a while ...] Jan 26 15:32:48 lisa kernel: end_request: I/O error, dev sdc, sector 5860532208 Jan 26 15:32:48 lisa kernel: end_request: I/O error, dev sdc, sector 2048 Jan 26 15:32:48 lisa kernel: end_request: I/O error, dev sdc, sector 2056 Jan 26 15:32:48 lisa kernel: end_request: I/O error, dev sdc, sector 2048 Jan 26 15:32:51 lisa kernel: block drbd10: 1 messages suppressed in /home/phil/rpmbuild/BUILD/drbd-8.3.15/drbd/drbd_req.c:959. Jan 26 15:32:51 lisa kernel: block drbd10: IO ERROR: neither local nor remote data, sector 0+0 Jan 26 15:32:51 lisa kernel: block drbd10: IO ERROR: neither local nor remote data, sector 0+0 Jan 26 15:32:51 lisa kernel: block drbd10: IO ERROR: neither local nor remote data, sector 0+0 Jan 26 15:33:13 lisa kernel: block drbd10: 9 messages suppressed in /home/phil/rpmbuild/BUILD/drbd-8.3.15/drbd/drbd_req.c:959. Jan 26 15:33:13 lisa kernel: block drbd10: IO ERROR: neither local nor remote data, sector 0+0 Jan 26 15:33:13 lisa kernel: block drbd10: IO ERROR: neither local nor remote data, sector 0+0 Jan 26 15:33:13 lisa kernel: block drbd10: IO ERROR: neither local nor remote data, sector 0+0 Jan 26 15:33:20 lisa kernel: block drbd10: 9 messages suppressed in /home/phil/rpmbuild/BUILD/drbd-8.3.15/drbd/drbd_req.c:959. Jan 26 15:33:20 lisa kernel: block drbd10: IO ERROR: neither local nor remote data, sector 0+0 Jan 26 15:33:48 lisa kernel: __ratelimit: 82 callbacks suppressed Jan 26 15:33:48 lisa kernel: sd 2:0:0:0: [sdc] Unhandled error code Jan 26 15:33:48 lisa kernel: sd 2:0:0:0: [sdc] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK Jan 26 15:33:48 lisa kernel: sd 2:0:0:0: [sdc] CDB: Read(10): 28 00 14 50 0f 80 00 00 08 00 Jan 26 15:33:48 lisa kernel: __ratelimit: 82 callbacks suppressed Jan 26 15:33:48 lisa kernel: sd 2:0:0:0: [sdc] Unhandled error code Jan 26 15:33:48 lisa kernel: sd 2:0:0:0: [sdc] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK Jan 26 15:33:48 lisa kernel: sd 2:0:0:0: [sdc] CDB: Read(10): 28 00 14 50 0f f0 00 00 08 00 Jan 26 15:33:48 lisa kernel: sd 2:0:0:0: [sdc] Unhandled error code Jan 26 15:33:48 lisa kernel: sd 2:0:0:0: [sdc] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK Jan 26 15:33:48 lisa kernel: sd 2:0:0:0: [sdc] CDB: Read(10): 28 00 07 d0 10 00 00 00 08 00 Jan 26 15:33:48 lisa kernel: sd 2:0:0:0: [sdc] Unhandled error code Jan 26 15:33:48 lisa kernel: sd 2:0:0:0: [sdc] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [...] Messages above will continue, I think you get the idea. At this point the DRBD is "diskless" on the primary host and "uptodate" on the secondary host. The secondary host has logged much less: Jan 26 15:32:21 bart kernel: block drbd10: pdsk( UpToDate -> Failed ) Jan 26 15:32:21 bart kernel: block drbd10: new current UUID D30492119F43C658:2EE09789BCC1DBCC:8085F400424C1059:8084F400424C1059 Jan 26 15:32:21 bart kernel: block drbd12: pdsk( UpToDate -> Failed ) Jan 26 15:32:21 bart kernel: block drbd12: new current UUID 7127576F45252CCE:BD7FD0BA1A43FE42:ADAC298F388B40A3:ADAB298F388B40A3 Jan 26 15:32:21 bart kernel: block drbd9: pdsk( UpToDate -> Failed ) Jan 26 15:32:21 bart kernel: block drbd9: new current UUID 5FBD52748069ACE8:5B0C1B6E2A3B9B9C:78E9A478FB9F1C15:78E8A478FB9F1C15 Jan 26 15:32:21 bart kernel: block drbd20: pdsk( UpToDate -> Failed ) Jan 26 15:32:21 bart kernel: block drbd20: new current UUID FCD845AF26021866:9D3F0E7D50FC1B10:A380AAB511DEC817:A37FAAB511DEC817 Jan 26 15:32:21 bart kernel: block drbd8: pdsk( UpToDate -> Failed ) Jan 26 15:32:21 bart kernel: block drbd8: new current UUID 1A84D3096787250A:4D4373FABBD8A3B8:11274BE647BFD1FB:11264BE647BFD1FB Jan 26 15:32:21 bart kernel: block drbd11: pdsk( UpToDate -> Failed ) Jan 26 15:32:21 bart kernel: block drbd11: new current UUID B1491E6E049252D6:AA40938D0A56F594:F0AA427AEE833AB1:F0A9427AEE833AB1 Jan 26 15:32:21 bart kernel: block drbd10: pdsk( Failed -> Diskless ) Jan 26 15:32:21 bart kernel: block drbd11: pdsk( Failed -> Diskless ) Jan 26 15:32:21 bart kernel: block drbd8: pdsk( Failed -> Diskless ) Jan 26 15:32:21 bart kernel: block drbd20: pdsk( Failed -> Diskless ) Jan 26 15:32:21 bart kernel: block drbd9: pdsk( Failed -> Diskless ) Jan 26 15:32:21 bart kernel: block drbd12: pdsk( Failed -> Diskless ) So, DRBD changed from "UpToDate" -> "Failed" -> "Diskless" on the primary host and continued to serve data from the secondary host. That is what I expect. What I did not expect, is that the virtual machine actually noticed this problem. On the console of one machine I could read "aborting journal" and all filesystems inside the VM were mounted readonly. From my understanding that behaviour should only occur on "on-io-error pass-on", but not for "detach". And even in the diskless-mode there were a lot of DRBD messages logged on the primary side, much later: Jan 26 15:42:35 lisa kernel: end_request: I/O error, dev sdc, sector 0 Jan 26 15:42:35 lisa kernel: end_request: I/O error, dev sdc, sector 8 Jan 26 15:42:35 lisa kernel: end_request: I/O error, dev sdc, sector 0 Jan 26 15:42:35 lisa kernel: end_request: I/O error, dev sdc, sector 5860532096 Jan 26 15:42:35 lisa kernel: end_request: I/O error, dev sdc, sector 5860532208 Jan 26 15:42:35 lisa kernel: end_request: I/O error, dev sdc, sector 2048 Jan 26 15:42:35 lisa kernel: end_request: I/O error, dev sdc, sector 2056 Jan 26 15:42:35 lisa kernel: end_request: I/O error, dev sdc, sector 2048 Jan 26 15:43:05 lisa kernel: sd 2:0:0:0: [sdc] Synchronizing SCSI cache Jan 26 15:43:05 lisa kernel: sd 2:0:0:0: [sdc] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK Jan 26 15:43:05 lisa kernel: sd 2:0:0:0: [sdc] Stopping disk Jan 26 15:43:05 lisa kernel: sd 2:0:0:0: [sdc] START_STOP FAILED Jan 26 15:43:05 lisa kernel: sd 2:0:0:0: [sdc] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK Jan 26 15:43:05 lisa kernel: ata3: hard resetting link Jan 26 15:43:10 lisa kernel: ata3: link is slow to respond, please be patient (ready=0) Jan 26 15:43:13 lisa kernel: block drbd10: 17 messages suppressed in /home/phil/rpmbuild/BUILD/drbd-8.3.15/drbd/drbd_req.c:959. Jan 26 15:43:13 lisa kernel: block drbd10: IO ERROR: neither local nor remote data, sector 0+0 Jan 26 15:43:13 lisa kernel: block drbd10: IO ERROR: neither local nor remote data, sector 0+0 Jan 26 15:43:13 lisa kernel: block drbd10: IO ERROR: neither local nor remote data, sector 0+0 Jan 26 15:43:15 lisa kernel: ata3: COMRESET failed (errno=-16) Jan 26 15:43:15 lisa kernel: ata3: hard resetting link Jan 26 15:43:20 lisa kernel: ata3: link is slow to respond, please be patient (ready=0) Jan 26 15:43:21 lisa kernel: block drbd10: 15 messages suppressed in /home/phil/rpmbuild/BUILD/drbd-8.3.15/drbd/drbd_req.c:959. Jan 26 15:43:21 lisa kernel: block drbd10: IO ERROR: neither local nor remote data, sector 0+0 Jan 26 15:43:25 lisa kernel: ata3: COMRESET failed (errno=-16) Jan 26 15:43:25 lisa kernel: ata3: hard resetting link Jan 26 15:43:30 lisa kernel: block drbd10: 17 messages suppressed in /home/phil/rpmbuild/BUILD/drbd-8.3.15/drbd/drbd_req.c:959. Jan 26 15:43:30 lisa kernel: block drbd10: IO ERROR: neither local nor remote data, sector 0+0 Jan 26 15:43:30 lisa kernel: block drbd10: IO ERROR: neither local nor remote data, sector 0+0 Jan 26 15:43:31 lisa kernel: ata3: link is slow to respond, please be patient (ready=0) Jan 26 15:44:00 lisa kernel: ata3: COMRESET failed (errno=-16) Jan 26 15:44:00 lisa kernel: ata3: hard resetting link Jan 26 15:44:05 lisa kernel: ata3: COMRESET failed (errno=-16) Jan 26 15:44:05 lisa kernel: ata3: reset failed, giving up Jan 26 15:44:05 lisa kernel: ata3: EH complete In summary: why did DRBD pass the local I/O-error to the upper-layer? What can be done to prevent this? Of course if any additional information are needed, I am happy to provide them. Regards, Matthias -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 308 bytes Desc: not available URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20130126/e90428c2/attachment.pgp>