[DRBD-user] drbdadm verify all seems to produce false positives on ext3 and crash the server

Mon Jun 23 20:18:58 CEST 2008

On Mon, Jun 23, 2008 at 04:32:25PM +0200, Eric Marin wrote:
> Hello and sorry about the length of this report,

I have a few comments, below.

>
> I'm the administrator of a cluster which consists of two nodes : ldap-a  
> and ldap-b.
>
> Both nodes are identical and run on :
> -Debian Etch (up to date)
> -DRBD 8.2.6
> -Heartbeat 2.1.3-5~bpo40+1 from etch-backports, patched to fix this bug  
> : http://hg.linux-ha.org/dev/rev/47f60bebe7b2
> -EXT3 filesystem
>
> I have edited /etc/crontab on ldap-a (which is usually the primary for  
> drbd0) to check for out-of-sync errors every sunday :
> --------8<----------------------------------------------------------------------
> 20 05   * * 0   root    /sbin/drbdadm verify all
> -----------------------------------------------------------------------8<-------
> ('all' means drbd0 and drbd1)
>
> This system has been in production for about three weeks.
>
>
>
> On the first sunday, out-of-sync sectors were detected nd one node  
> completely crashed at the end of the check, but I found out that the  
> system had been running with a corrupted filesystem for a few days... So  
> I repaired the filesystem, restored the files from a backup,  
> synchronized the nodes, and verified that 'drbdadm verify all' did not  
> complain.
>
> On the second sunday, nothing unusual was found.
>
> On the third sunday, I get this in /var/log/kern.log on ldap-a :
> --------8<-----------------------------------------------------------------------
> Jun 22 05:20:01 ldap-a kernel: drbd0: conn( Connected -> VerifyS )
> Jun 22 05:20:01 ldap-a kernel: drbd1: conn( Connected -> VerifyS )
> Jun 22 05:21:49 ldap-a kernel: drbd0: Out of sync: start=7094280, size=8  
> (sectors)
> Jun 22 05:21:58 ldap-a kernel: drbd0: Out of sync: start=7667800, size=8  
> (sectors)
> Jun 22 05:25:03 ldap-a kernel: drbd1: Online verify  done (total 301  
> sec; paused 0 sec; 32448 K/sec)
> Jun 22 05:25:03 ldap-a kernel: drbd1: conn( VerifyS -> Connected )
> Jun 22 05:25:03 ldap-a kernel: drbd0: Online verify  done (total 301  
> sec; paused 0 sec; 32448 K/sec)
> Jun 22 05:25:03 ldap-a kernel: drbd0: Online verify found 2 4k block out  
> of sync!
> Jun 22 05:25:03 ldap-a kernel: drbd0: helper command: /sbin/drbdadm  
> out-of-sync
> Jun 22 05:25:03 ldap-a kernel: drbd0: conn( VerifyS -> Connected )
> Jun 22 05:25:03 ldap-a kernel: drbd0: Writing the whole bitmap, due to  
> failed kmalloc
> Jun 22 05:25:03 ldap-a kernel: drbd0: writing of bitmap took 0 jiffies
> Jun 22 05:25:03 ldap-a kernel: drbd0: 8 KB (2 bits) marked out-of-sync  
> by on disk bit-map.
> -------------------------------------------------------------------------8<-------
>
> drbd0 is corrupted !

well.
it is not exactly in sync.
no need to panic just yet.

> From reading the manpage and the forums, it seems that false positives  
> could occur with a swap or reiserfs filesystem, unfortunately drbd0 and  
> drbd1 use ext3, so I'm not sure what to make of this...
>
>
> => So here is what I did, and what I'd like you to comment on :
>
> First, I make sure the sectors are indeed desynchronized :
> ==========================================================
> ldap-a:~# dd iflag=direct bs=512 skip=7094280 count=8 if=/dev/sda7 | xxd  -a > ldap-a-drbd0-7094280.dump
> ldap-a:~# dd iflag=direct bs=512 skip=7667800 count=8 if=/dev/sda7 | xxd  -a > ldap-a-drbd0-7667800.dump
> ldap-b:~# dd iflag=direct bs=512 skip=7094280 count=8 if=/dev/sda7 | xxd  -a > ldap-b-drbd0-7094280.dump
> ldap-b:~# dd iflag=direct bs=512 skip=7667800 count=8 if=/dev/sda7 | xxd  -a > ldap-b-drbd0-7667800.dump
> ldap-b:~# scp ldap-b-drbd0-7094280.dump ldap-a:/root/
> ldap-b:~# scp ldap-b-drbd0-7667800.dump ldap-a:/root/
> ldap-a:~# diff ldap-a-drbd0-7094280.dump ldap-b-drbd0-7094280.dump

good job!
thanks.

I personally find "diff -U0" easier to read, but thanks, this does help.

> 8c8
> < 0000f10: 1c01 0000 0d36 9d02 0000 0000 0000 0000  .....6..........
> ---
> > 0000f10: 1c01 0000 5d36 9d02 0000 0000 0000 0000  ....]6..........

so, we have a two bit flip there,
one side 5d36, the other 0d36

>
> ldap-a:~# diff ldap-a-drbd0-7667800.dump ldap-b-drbd0-7667800.dump
> 248c248
> < 0000f70: 0380 0000 c01e ea13 0000 0000 259c 9902  ............%...
> ---
> > 0000f70: 0380 0000 c01e ea13 0000 0000 3f9c 9902  ............?...

three bits this time.

> 253c253
> < 0000fc0: 0380 0000 55f8 9902 0400 0000 a657 ef03  ....U........W..
> ---
> > 0000fc0: 0380 0000 6ff8 9902 0400 0000 c857 ef03  ....o........W..

five bits and an other five bits

> 256c256
> < 0000ff0: 0000 0000 0000 0000 11be 9902 0000 0000  ................
> ---
> > 0000ff0: 0000 0000 0000 0000 2bbe 9902 0000 0000  ........+.......

four bit this time.

> Then I check whether these sectors are in use or if they are unallocated :

this does not really help "after the fact".
it would be of interesst to see wether this is in file "payload data"
area, or in file system "meta data" area (allocation bitmaps and such).

I don't know from the top of my head how to find out for ext3,
but I remember it was not too difficult.

> Should I worry about these out-of-sync alerts, or are they a kind of  
> false-positive ?

there are a few basic ways how data can end up being different on the
two replicas. let us leave out additional modes that could be seen when
hard node crashes/power cycles/etc are involved, but focus only on what
could happen during "normal" operation.
  - hardware problems, like broken RAM
  - a bug in drbd
  - a bug in other drivers
  - someone fiddling around with direct access or otherwise bypassing
    drbd

for now, let us assume we can exclude the above here.
that leaves us with

  - transfer errors
  - modification of in flight data buffers

transfer errors:
data being altered while it is transfered from one in RAM buffer to the
other in RAM buffer. tcp-checksums should help here, but cannot when
calculated in the NIC (offloading), and the bit flips occur on the bus.
end-to-end checksumming (data-integrity-alg setting in drbd.conf)
can catch such problems better.

modifications of in flight data buffers:
some file systems or usage patterns aparently do
  submit a data buffer to a block device, and then
  change the content of that data buffer while that block is in flight
  without waiting for the write request to be completed.

I think I have seen this happen with ext3, too.
this may have been a generic race in the kernel write-out path,
I don't know. I have not been able to reproduce this lately,
but we have where it would submit a certain

here the problem for DRBD is that the local harddisk may get
a different version than the remote one, depending on the timing of
which version of that data buffer is send over the wire, and which
version hits the local disk.

given the multiple bit errors (17 bits! changed in block 7667800)
I'd say it is more likely we see "modification of in flight buffers"
here, unless your bus is _very_ noisy at times, or your chipset behaves
funny under certain conditions. but you never know.

> In that case, isn't there a way to make it more accurate?

I don't know of any.
well, yes. use a single cpu box,
do a memcopy into a private buffer first,
checksum that, and submit/send that private buffer.

if anyone has better suggestions, this is the time to let me know.

> Should I disable TCP checksum offloadingl? Should I enable  
> 'replication traffic integrity checking'?

maybe first enable the "integrity checking",
see if that detects something,
then disable the checksum offloading,
see if it still occurs,
if not leave offloading disabled,
but disable again the "integrity checking" for performance reasons.

> ldap-a has crashed twice in three weeks,

what exactly does "crashed" mean?
just "unresponsive"? panic? oops? BUG()?
any logs? anything from the console?

oh, and in that case (crashed a few times),
we need to consider more failure modes:
is there any volatile disk cache involved?
is no-disk-flushes set?
is no-md-flushes set?

> and both times it was while  
> executing 'drbdadm verify all'. Couldn't that be a bug in DRBD ?

possibly, but I'd say unlikely, until proven otherwise (by an oops stack
trace for example).  I'd say the additional load of the verify caused
some unexpected memory pressure/io-load, and you box was unable to
handle that. serialize your resources, reduce the syncer rate, maybe
reduce "max-buffers".

-- 
: Lars Ellenberg                           http://www.linbit.com :
: DRBD/HA support and consulting             sales at linbit.com :
: LINBIT Information Technologies GmbH      Tel +43-1-8178292-0  :
: Vivenotgasse 48, A-1120 Vienna/Europe     Fax +43-1-8178292-82 :
__
please don't Cc me, but send to list -- I'm subscribed