[DRBD-user] Possible bug in drbd after an IO error

Lars Ellenberg Lars.Ellenberg at linbit.com
Wed Sep 21 22:23:12 CEST 2005

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


/ 2005-09-21 16:14:05 +0200
\ Francois Morris:
> Hi,
> Using drbd 0.7.10 on kernel 2.6.9 (Red Hat Enterprise Linux AS v4) I
> experienced what seems to be a bug. Here is the scenario:
> 1) An I/O error occurred on the disk on the primary node
> 2) The disk is detached and the primary node is set diskless
> 3) All is running without problem for 2 days

hm. "detached" or diskless is a degraded mode,
and diskless primary should be switched over asap.

we allow you to configure this (not hardcoded a halt/panic/die)
just because it might be more "friendly" for the operator to be able to
notify the users prior to the switch.

> 4) The kernel on the primary node run out off memory. I don't know why.
> Perhaps drbd is requiring to much memory (due to a memory leak ?).

unlikely (there are long uptimes with drbd; someone should habe noticed)
but we can have an eye on that one, too.

> But I don't think it is the most important issue.
> 5) The primary froze and had to be rebooted
> 6) drbd on the primary restarted and a synchronisation from primary to
> secondary is then performed. Obviously it was the wrong thing to do and
> some data  were lost. Before rebooting the primary was diskless so the
> consistent data was on the secondary and the synchronisation should have
> been from secondary to primary.

so to be exact, we talk about at least a "double failure" here,
and we don't pretend to be able to cope with those.

though we should handle them as gracefully as possible,
so there is probably room for improvement.

> Analysing the log I have found:
> 1) Never the secondary wrote a message saying the primary host switched
> to diskless because of an error on the disk. Probably it never received
> the information.
> 2) The data on the primary is tagged as consistent after the restart.

hm.
seemingly this information has never made it to the meta data area,
nor to the peer.

> After looking at the source I am asking some questions:
> The message in the log file "Local IO failed. Detaching" is written by
> the function drbd_chk_io_error. The bits MD_IO_ALLOWED, MDF_FullSync,
> DISKLESS and indirectly MD_DIRTY are set and the bit MDF_Consistent is
> cleared in this function.  But contrary to the function drbd_io_error
> there is no call to drbd_send_param or to drbd_md_write.

right. we are (possibly) in irq context when calling drbd_chk_io_error,
we must not do anything that might sleep...

> Often in the source the call to drbd_chk_io_error is followed by one
> to drbd_io_error but not everywhere.

not when we are (possibly) in irq context.

> I suppose the problem was due to the fact the inconsistent status  was
> never written to the disk and never sent to the other node. So after a
> restart no information in the meta data or from the other node can
> tell the disk was inconsistent. Am I wrong ?

seems like you are right.
we'll have a look.

> I attach the configuration and log files.

thanks.

> ##### quartz is the primary node, turquoise the secondary one
> ##### I/O error occured on the disk corresponding to drbd1 
> ##### while we are performing daily indexing of large amount of data on drbd1
> 
> Aug 28 07:37:49 quartz kernel: SCSI error : <0 0 0 0> return code = 0x8000002 
> Aug 28 07:37:49 quartz kernel: Info fld=0x61d49b5, Current sda: sense key Hardware Error 
> Aug 28 07:37:49 quartz kernel: ASC=40 ASCQ=8c 
> Aug 28 07:37:49 quartz kernel: end_request: I/O error, dev sda, sector 102582695 
> 
> #### quartz changed to diskless. Seems OK. 
> Aug 28 07:37:49 quartz kernel: drbd1: Local IO failed. Detaching... 
> Aug 28 07:37:49 quartz kernel: drbd1: local read failed, retrying remotely 
> #### No other message from drbd. No message on turquoise, strange.

we should have something that (tries to) write the changed meta
data flags to stable storage, and we should notify the peer.
something seems to be missing here.

> #### Two days later while performing again indexing the kernel run out of memory
> .... a lot of messages suppressed ....
> 
> #### quartz is now rebooted ####
> 
> Aug 30 08:50:08 turquoise kernel: drbd1: PingAck did not arrive in time. 
> Aug 30 08:50:08 turquoise kernel: drbd1: drbd1_asender [2750]: cstate Connected --> NetworkFailure 
> Aug 30 08:50:08 turquoise kernel: drbd1: asender terminated 
> Aug 30 08:50:08 turquoise kernel: drbd1: drbd1_receiver [2732]: cstate NetworkFailure --> BrokenPipe 
> Aug 30 08:50:08 turquoise kernel: drbd1: short read expecting header on sock: r=-512 
> Aug 30 08:50:08 turquoise kernel: drbd1: worker terminated 
> Aug 30 08:50:08 turquoise kernel: drbd1: drbd1_receiver [2732]: cstate BrokenPipe --> Unconnected 
> Aug 30 08:50:08 turquoise kernel: drbd1: Connection lost. 
> Aug 30 08:50:08 turquoise kernel: drbd1: drbd1_receiver [2732]: cstate Unconnected --> WFConnection 
> Aug 30 09:41:31 quartz kernel: klogd 1.4.1, log source = /proc/kmsg started. 
> Aug 30 09:41:31 quartz kernel: Linux version 2.6.9-11.ELsmp (bhcompile at decompose.build.redhat.com) (gcc version 3.4.3 20050227 (Red Hat 3.4.3-22)) #1 SMP Fri May 20 18:26:27 EDT 2005 
> Aug 30 09:41:31 quartz kernel: BIOS-provided physical RAM map: 
> Aug 30 09:41:31 quartz kernel:  BIOS-e820: 0000000000000000 - 00000000000a0000 (usable) 
> 
> ... messages suppressed ...
> 
> Aug 30 09:41:49 quartz kernel: drbd: initialised. Version: 0.7.10 (api:77/proto:74) 
> Aug 30 09:41:49 quartz kernel: drbd: SVN Revision: 1743 build by root at quartz.lmcp.jussieu.fr, 2005-06-29 14:44:43 
> Aug 30 09:41:49 quartz kernel: drbd: registered as block device major 147 
> Aug 30 09:41:57 quartz kernel: drbd1: resync bitmap: bits=11272192 words=352256 
> Aug 30 09:41:58 quartz kernel: drbd1: size = 43 GB (45088768 KB) 
> Aug 30 09:42:00 quartz kernel: drbd1: 0 KB marked out-of-sync by on disk bit-map. 
> Aug 30 09:42:00 quartz kernel: drbd1: Found 6 transactions (276 active extents) in activity log. 
> Aug 30 09:42:00 quartz kernel: drbd1: Marked additional 128 MB as out-of-sync based on AL. 
> Aug 30 09:42:03 quartz kernel: drbd1: drbdsetup [2961]: cstate Unconfigured --> StandAlone 
> Aug 30 09:42:03 turquoise kernel: drbd1: drbd1_receiver [2732]: cstate WFConnection --> WFReportParams 
> Aug 30 09:42:03 turquoise kernel: drbd1: Handshake successful: DRBD Network Protocol version 74 
> Aug 30 09:42:03 turquoise kernel: drbd1: Connection established. 
> Aug 30 09:42:03 turquoise kernel: drbd1: I am(S): 1:00000002:00000001:00000019:00000003:01 
> Aug 30 09:42:03 turquoise kernel: drbd1: Peer(S): 1:00000002:00000001:00000019:00000003:11 
                                                    ^                                     ^
   both think they are "consistent", and quartz still knows it has been primary before.
   the "diskless" information obviously never made it to disk.

> #### Why is turquoise SyncTarget ? The date are inconsistent on quartz.
> ##### The data on drbd1 are obviously not consistent but MDF_Consistent bit is set.
> Aug 30 09:42:06 quartz kernel: drbd1: I am(S): 1:00000002:00000001:00000019:00000003:11 
> Aug 30 09:42:07 quartz kernel: drbd1: Peer(S): 1:00000002:00000001:00000019:00000003:01 


so, we seem to have at least one code path within the
local io error => detach code that "forgets" to notify the peer,
and equally does not "remember" this event itself.

that is bad.
will be fixed asap.

Thanks,

-- 
: Lars Ellenberg                                  Tel +43-1-8178292-0  :
: LINBIT Information Technologies GmbH            Fax +43-1-8178292-82 :
: Schoenbrunner Str. 244, A-1120 Vienna/Europe   http://www.linbit.com :
__
please use the "List-Reply" function of your email client.



More information about the drbd-user mailing list