[DRBD-user] Possible bug in drbd after an IO error

Francois Morris Francois.Morris at lmcp.jussieu.fr
Wed Sep 21 16:14:05 CEST 2005

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hi,
Using drbd 0.7.10 on kernel 2.6.9 (Red Hat Enterprise Linux AS v4) I
experienced what seems to be a bug. Here is the scenario:
1) An I/O error occurred on the disk on the primary node
2) The disk is detached and the primary node is set diskless
3) All is running without problem for 2 days
4) The kernel on the primary node run out off memory. I don't know why.
Perhaps drbd is requiring to much memory (due to a memory leak ?). But I
don't think it is the most important issue.
5) The primary froze and had to be rebooted
6) drbd on the primary restarted and a synchronisation from primary to
secondary is then performed. Obviously it was the wrong thing to do and
some data  were lost. Before rebooting the primary was diskless so the
consistent data was on the secondary and the synchronisation should have
been from secondary to primary.
 
Analysing the log I have found:
1) Never the secondary wrote a message saying the primary host switched
to diskless because of an error on the disk. Probably it never received
the information.
2) The data on the primary is tagged as consistent after the restart.
 
After looking at the source I am asking some questions:
The message in the log file "Local IO failed. Detaching" is written by
the function drbd_chk_io_error. The bits MD_IO_ALLOWED, MDF_FullSync,
DISKLESS and indirectly MD_DIRTY are set and the bit MDF_Consistent is
cleared in this function.  But contrary to the function drbd_io_error
there is no call to drbd_send_param or to drbd_md_write. Often in the
source the call to drbd_chk_io_error is followed by one to drbd_io_error
but not everywhere. I suppose the problem was due to the fact the
inconsistent status  was never written to the disk and never sent to the
other node. So after a restart no information in the meta data or from
the other node can tell the disk was inconsistent. Am I wrong ?
I attach the configuration and log files.


-- 
François Morris Francois.Morris at impmc.jussieu.fr
IMPMC, Université P. et M. Curie, CNRS
case 115 - 4, place Jussieu - 75252 PARIS CEDEX 05 - FRANCE
Tel: +33144275073 Fax: +33144273785 http://www.impmc.jussieu.fr/~morris
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: log.txt
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20050921/9bf88e3a/attachment.txt>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: drbd.conf
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20050921/9bf88e3a/attachment-0001.txt>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 3630 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20050921/9bf88e3a/attachment.bin>


More information about the drbd-user mailing list