[DRBD-user] RE: drbd 0.6.12 + heartbeat + synchronization + machine load

Thu May 20 12:34:49 CEST 2004

 

	> Hello,
	>  
	> I am testing drbd + heartbeat for an HA setup consisiting of
two cluster
	> members. The first is A Dell 2400, 256MB, Dual PIII 500, HW
Raid. The
	> second is a Dell 2300, 128Mb, Single PIII500, Soft RAID. Both
systems
	> are running RedHat 9 with 2.4.20-31.9smp kernel (the single
proc box
	> because of a bug in the 440GX chipset, APIC only works when
running SMP
	> kernel). I am using 0.6.12 as 0.7 seemed hell on my machines
(loads of
	> kernel oopses, panics, hangs etc.). So far I've been having
good
	> results. Tested failover between nodes, which all worked well.
Until I
	> decided to test the all out disaster scenario.
	>  
	 
	I may be wrong, but some of those 2.4.20 kernels have scheduler
problems. 
	Could be, but haven't seen any evidence in that direction. 
	 
	> First I took down my primary cluster node (did this by
disconnecting all
	> NIC's). Failover went well as expected. Then decided to go for
all-out
	> by gracefully shutting down the secondary node. In this
scenario you
	> would boot up the secondary cluster node first, as that would
have the
	> latest data set. And as I want HA, decided not to wait for the
other
	> side of drbd to show up and make disks primary. Up until this
point
	> still no problem, disks would be mounted and data served from
the
	> secondary cluster node. 
	>  
	> But when I booted my primary cluster node, shit did really hit
the van
	> (you should see my office, it smells terrible ;-). As soon as
it started
	> replicating off data from the secondary cluster node, problems
started.
	> Immediately both of the nodes were showing lock-up problems
(eg, not
	> able to log in on console / ssh etc.). Already logged in
sessions kept
	> working except for doing su would lock up also. A 'cat
/proc/drbd' would
	> initially show acceptable speeds (around 5MB/s, my sync min.
Syncing
	> from primary node to secondary would reach 10MB/s+). Also the
system
	> load would slowly increase up unto the point where heartbeat
generated
	> failover: (If I run softdog, it would even just reset the
machine)
	>  11:09:37  up 10:23,  1 user,  load average: 3.58, 3.00, 2.41
	> 85 processes: 75 sleeping, 7 running, 3 zombie, 0 stopped
	> CPU states:  70.9% user  29.0% system   0.0% nice   0.0%
iowait   0.0%
	> idle
	> Mem:   125412k av,  122820k used,    2592k free,       0k
shrd,   36628k
	> buff
	>                      78112k actv,     796k in_d,    1624k in_c
	> Swap:  787064k av,    1184k used,  785880k free
54192k
	> cached
	> (CPU was usually not at 100%, but more like 25 to 30%), Load
3+ on a
	> Single CPU machine, while not using that much mem and cpu
time, that's
	> weird.
	>  
	> Also at this point sync speeds would drop to under 1MB/s. Plus
the
	> console got overloaded with these messages:
	> drbd1: [drbd_syncer_1/4321] sock_sendmsg time expired, ko =
4294967295
	> drbd1: [drbd_syncer_1/4321] sock_sendmsg time expired, ko =
4294967294
	> drbd1: [drbd_syncer_1/4321] sock_sendmsg time expired, ko =
4294967295
	> drbd1: [drbd_syncer_1/4321] sock_sendmsg time expired, ko =
4294967295
	> drbd1: [drbd_syncer_1/4321] sock_sendmsg time expired, ko =
4294967294
	> drbd1: [drbd_syncer_1/4321] sock_sendmsg time expired, ko =
4294967295
	> drbd1: [drbd_syncer_1/4321] sock_sendmsg time expired, ko =
4294967294
	> drbd1: [drbd_syncer_1/4321] sock_sendmsg time expired, ko =
4294967295
	> drbd1: [drbd_syncer_1/4321] sock_sendmsg time expired, ko =
4294967295
	> drbd1: [drbd_syncer_1/4321] sock_sendmsg time expired, ko =
4294967294
	> drbd1: [drbd_syncer_1/4321] sock_sendmsg time expired, ko =
4294967295
	> drbd1: [drbd_syncer_1/4321] sock_sendmsg time expired, ko =
4294967294
	> drbd1: [drbd_syncer_1/4321] sock_sendmsg time expired, ko =
4294967295
	> drbd1: [drbd_syncer_1/4321] sock_sendmsg time expired, ko =
4294967295
	> 
	> I've tryd fiddling with sync parameters (sync-nice,
sync-group, tl-size,
	> etc.) nothing helped, although symptoms did vary (time before
lock-ups
	> of system, time before HB failed over, less or more of these
	> sock_sendmsg messages).
	>  
	> As soon as Hearbeat had shut itself down, sync speed would
sometimes go
	> up again, but other tiimes remained low. Same thing with the
load
	> somtimes went down to normal values, sometimes not. System
lock ups too.
	> Stopping the sync by disconnecting the secondary cluster node
always
	> brought sysmtems back to normal.
	>  
	> The only way systems remained stable was doing the sync in
single user
	> mode. But as it's 70GB of data we're talking about and 5MB/s
sync would
	> take 3hrs+, this would be unacceptable downtime. I will now
start with a
	> new dataset and see if I can reproduce the problem. I am not
going to
	> wait for sync to finish in single user mode. I would not mind,
if in a
	> situation like this syncing the data back to the primary node,
takes a
	> day, but it has to be stable and the secondary node has to
serve the
	> data in the meantime.
	>  
	> My drbd.conf:
	> resource drbd0 {
	>   protocol = C
	>   fsckcmd = /bin/true
	>  
	>   disk {
	>     disk-size = 4890000k
	>     do-panic
	>   }
	>  
	>   net {
	>   sync-group = 0
	>   sync-rate = 8M
	>   sync-min = 5M
	>   sync-max = 10M
	 
	if you read the example config file, you find that -rate is
deprecated
	synonym for -max. so don't use it.
	 
	sync-nice will have no effect when you set -min to something
which is
	way abavoe what you typically see, since as long as the actuyl
sync
	throughput is below -min, syncer sets itself to highest possible
	priority. 
	 
	>   sync-nice = 0
	>   tl-size = 5000
	>  
	>   ping-int = 10
	>   timeout = 9
	 
	NO!
	of course you may *whish* that DRBD recognises drop of
connection in 0.9
	seconds. but do not do this on a 100MBit link, and not with
500MHz CPU.
	that will cause a packet storm whenever the connection gets
congested,
	congesting it even more.
	 
	timeout is in centiseconds! (read the example config file!)
	default is 60, so why did you set it to 9 ?
	Because if I'd set it to 60, it would display the error, that
timeout needs to be lower than ping-int and connect-int. I discovered,
that if you fiddle with these settings, you need to include all three in
your config. I now set ping-int and connect-int to 10 and timeout to 90.

	>   }
	>  
	>   on syslogcs-cla {
	>     device = /dev/nb0
	>     disk = /dev/sdb2
	>     address = 10.0.0.1
	>     port = 7788
	>   }
	>  
	>   on syslogcs-clb {
	>     device = /dev/nb0
	>     disk = /dev/md14
	>     address = 10.0.0.2
	>     port = 7788
	>   }
	> }
	>  
	> resource drbd1 {
	>   protocol = C
	>   fsckcmd = /bin/true
	>  
	>   disk {
	>     disk-size = 64700000k
	>     do-panic
	>   }
	>  
	>   net {
	>   sync-group = 1
	>   sync-rate = 8M
	>   sync-min = 5M
	>   sync-max = 10M
	>   sync-nice = 19
	 
	again, see comments for the sync section above!
	 
	>   tl-size = 5000
	>  
	>   ping-int = 10
	>   timeout = 9
	>   }
	>  
	>   on syslogcs-cla {
	>     device = /dev/nb1
	>     disk = /dev/sdb3
	>     address = 10.0.0.1
	>     port = 7789
	>   }
	>  
	>   on syslogcs-clb {
	>     device = /dev/nb1
	>     disk = /dev/md15
	>     address = 10.0.0.2
	>     port = 7789
	>   }
	> }
	> /dev/md14 is RAID0 made of two RAID1 pairs (md9 & md10)
	> /dev/md15 is RAID0 made of two RAID1 pairs (md11 & md12)
	 
	hm.
	DRBD on MD on MD on SCSI ? cute. 
	I was more like 'sweeet' that it worked. But then again a block
device = a block device = a block device = a....... 
	 
	> 
	> Output of mount commands:
	> drbd1: blksize=1024 B
	> drbd1: blksize=4096 B
	> kjournald starting.  Commit interval 5 seconds
	> EXT3 FS 2.4-0.9.19, 19 August 2002 on drbd(43,1), internal
journal
	> EXT3-fs: mounted filesystem with ordered data mode.
	> Why the different block size? Both disks have this when
mounting.
	 
	because the superblock access of ext fs on mount/umount is 1024.
	then it reads the fs parameters, and resets the block size again
to 4096.  
	 
	> Sometimes I get the message, that md device used obsolete
ioctl, but
	> this should only be cosmetical.
	 
	yes.
	 
	> Sometimes got the message on the SW RAID systdm, that block
size
	> couldn't be determined and 512b was assumed.
	 
	yes.
	 
	> The SW RAID seems to outperform the HW RAID by 100%
	 
	yes. :) 
	Now, in the past few days it seemed more an more like this was
my problem. I did some speed testing on the HW and SW raid. I
discovered, that write rates on the HW RAID just didn't get any better
than 5=6MB/s. As the sync rate was higher than that it overloaded my
system. I tried fiddling with syncmax, but the only thing I seemd to
achieve is to delay the locking up of both systems (ie even with
sync-max to 1MB/s, it would lock up after about 30mins).
	Luckily the DELL 2400 comes with BIOS option to switch from HW
raid to SCSI. I did and now both systems are SW RAID, no more lock ups,
but sync speed now seems to drop to around 700KB/s after 15-20 minutes
	Leaves the question though why under / over performance of one
side of the link causes both machines to lock up. This shouldn't happen.
Never.
	 
	> On rare occasions I saw lock-ups of fsck or mount during
heartbeat
	> start-up. One time even causing entire system to hang during
reboot
	> (killall was not able to kill a hanging mount process.)
	> Maybe also important info: Some md devices were syncing at the
same time
	> drbd devices were syncing. This too was not acheiving high
speeds. You
	> would expect this, when drbd sync uses 5MB, but not when that
drops. You
	> then would expect md sync to go faster, but it didn't, it
would stay at
	> 100-300KB/s.
	 
	so don't expect performance, when you thrash your harddisks on
an SMP
	box with several kernel threads *plus* applications.
	no wonder your box goes south.
	 
	on a very different scale of course, but what your are doing is
similar
	to running a high performance database with its backing storage
being a
	floppy. if your hardware cannot cope with what your software
demands,
	than thats what you get: unresponsive systems at best. 
	I disagree, syncing should NEVER trash harddisks. MD sync
doesn't. Sync prio just lowers as IO increases. Also the apps I were
running, weren't even generating IO at the time. Also, as the config
might suggest, I'm setting up a syslog collector cluster, which needs to
process around 128Kbits/s of log data. I would actually expect to be
able to write that to a floppy, wouldn't I? So if, MD sync is doing
100KB/s drbd sync should be able to do 10MB/s easily and vice versa.
	 
	>  
	> Lots of information, but probably more needed. I will let you
know if I
	> can reproduce the problem, when I have created new datasets to
test
	> with.
	 
	BTW,
	ext3 is known to have performance problems on SMP under heavy io
load
	anyways, which is because then writes which where contiguous on
UP (for
	the respective applications) become scattered between multiple
threads,
	and all aplications will have to seek ... you know what happens.

	This would be entirely out of the question and irrelevant, as
the SMP Box was just receiving sync data and did not have the FS
mounted. And as stated above, IO load is not high, it is very low
indeed.
	 
	 Lars Ellenberg
	 
	 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20040520/9a5dbeba/attachment.htm>