<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=Content-Type content="text/html; charset=us-ascii">
<META content="MSHTML 6.00.2800.1400" name=GENERATOR></HEAD>
<BODY>
<DIV dir=ltr align=left><FONT face=Arial color=#0000ff
size=2></FONT> </DIV>
<BLOCKQUOTE dir=ltr
style="PADDING-LEFT: 5px; MARGIN-LEFT: 5px; BORDER-LEFT: #0000ff 2px solid; MARGIN-RIGHT: 0px">
<DIV></DIV>
<DIV><FONT face=Arial size=2>> Hello,<BR>> <BR>> I am testing
drbd + heartbeat for an HA setup consisiting of two cluster<BR>> members.
The first is A Dell 2400, 256MB, Dual PIII 500, HW Raid. The<BR>> second is
a Dell 2300, 128Mb, Single PIII500, Soft RAID. Both systems<BR>> are
running RedHat 9 with 2.4.20-31.9smp kernel (the single proc box<BR>>
because of a bug in the 440GX chipset, APIC only works when running
SMP<BR>> kernel). I am using 0.6.12 as 0.7 seemed hell on my machines
(loads of<BR>> kernel oopses, panics, hangs etc.). So far I've been having
good<BR>> results. Tested failover between nodes, which all worked well.
Until I<BR>> decided to test the all out disaster scenario.<BR>>
</FONT></DIV>
<DIV><FONT face=Arial color=#0000ff size=2></FONT> </DIV>
<DIV><FONT face=Arial><FONT size=2>I may be wrong, but some of those 2.4.20
kernels have scheduler problems.<SPAN class=337340310-20052004><FONT
color=#0000ff> </FONT></SPAN></FONT></FONT></DIV>
<DIV><FONT face=Arial><FONT size=2><SPAN class=337340310-20052004><FONT
color=#0000ff>Could be, but haven't seen any evidence in that
direction.</FONT> </SPAN></FONT></FONT></DIV>
<DIV><FONT face=Arial color=#0000ff size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>> First I took down my primary cluster node
(did this by disconnecting all<BR>> NIC's). Failover went well as expected.
Then decided to go for all-out<BR>> by gracefully shutting down the
secondary node. In this scenario you<BR>> would boot up the secondary
cluster node first, as that would have the<BR>> latest data set. And as I
want HA, decided not to wait for the other<BR>> side of drbd to show up and
make disks primary. Up until this point<BR>> still no problem, disks would
be mounted and data served from the<BR>> secondary cluster node.
<BR>> <BR>> But when I booted my primary cluster node, shit did
really hit the van<BR>> (you should see my office, it smells terrible ;-).
As soon as it started<BR>> replicating off data from the secondary cluster
node, problems started.<BR>> Immediately both of the nodes were showing
lock-up problems (eg, not<BR>> able to log in on console / ssh etc.).
Already logged in sessions kept<BR>> working except for doing su would lock
up also. A 'cat /proc/drbd' would<BR>> initially show acceptable speeds
(around 5MB/s, my sync min. Syncing<BR>> from primary node to secondary
would reach 10MB/s+). Also the system<BR>> load would slowly increase up
unto the point where heartbeat generated<BR>> failover: (If I run softdog,
it would even just reset the machine)<BR>> 11:09:37 up
10:23, 1 user, load average: 3.58, 3.00, 2.41<BR>> 85
processes: 75 sleeping, 7 running, 3 zombie, 0 stopped<BR>> CPU
states: 70.9% user 29.0% system 0.0% nice
0.0% iowait 0.0%<BR>> idle<BR>> Mem: 125412k
av, 122820k used, 2592k
free, 0k shrd, 36628k<BR>>
buff<BR>>
78112k actv, 796k in_d, 1624k
in_c<BR>> Swap: 787064k av, 1184k used,
785880k
free
54192k<BR>> cached<BR>> (CPU was usually not at 100%, but more like 25
to 30%), Load 3+ on a<BR>> Single CPU machine, while not using that much
mem and cpu time, that's<BR>> weird.<BR>> <BR>> Also at this
point sync speeds would drop to under 1MB/s. Plus the<BR>> console got
overloaded with these messages:<BR>> drbd1: [drbd_syncer_1/4321]
sock_sendmsg time expired, ko = 4294967295<BR>> drbd1: [drbd_syncer_1/4321]
sock_sendmsg time expired, ko = 4294967294<BR>> drbd1: [drbd_syncer_1/4321]
sock_sendmsg time expired, ko = 4294967295<BR>> drbd1: [drbd_syncer_1/4321]
sock_sendmsg time expired, ko = 4294967295<BR>> drbd1: [drbd_syncer_1/4321]
sock_sendmsg time expired, ko = 4294967294<BR>> drbd1: [drbd_syncer_1/4321]
sock_sendmsg time expired, ko = 4294967295<BR>> drbd1: [drbd_syncer_1/4321]
sock_sendmsg time expired, ko = 4294967294<BR>> drbd1: [drbd_syncer_1/4321]
sock_sendmsg time expired, ko = 4294967295<BR>> drbd1: [drbd_syncer_1/4321]
sock_sendmsg time expired, ko = 4294967295<BR>> drbd1: [drbd_syncer_1/4321]
sock_sendmsg time expired, ko = 4294967294<BR>> drbd1: [drbd_syncer_1/4321]
sock_sendmsg time expired, ko = 4294967295<BR>> drbd1: [drbd_syncer_1/4321]
sock_sendmsg time expired, ko = 4294967294<BR>> drbd1: [drbd_syncer_1/4321]
sock_sendmsg time expired, ko = 4294967295<BR>> drbd1: [drbd_syncer_1/4321]
sock_sendmsg time expired, ko = 4294967295<BR>> <BR>> I've tryd fiddling
with sync parameters (sync-nice, sync-group, tl-size,<BR>> etc.) nothing
helped, although symptoms did vary (time before lock-ups<BR>> of system,
time before HB failed over, less or more of these<BR>> sock_sendmsg
messages).<BR>> <BR>> As soon as Hearbeat had shut itself down,
sync speed would sometimes go<BR>> up again, but other tiimes remained low.
Same thing with the load<BR>> somtimes went down to normal values,
sometimes not. System lock ups too.<BR>> Stopping the sync by disconnecting
the secondary cluster node always<BR>> brought sysmtems back to
normal.<BR>> <BR>> The only way systems remained stable was doing
the sync in single user<BR>> mode. But as it's 70GB of data we're talking
about and 5MB/s sync would<BR>> take 3hrs+, this would be unacceptable
downtime. I will now start with a<BR>> new dataset and see if I can
reproduce the problem. I am not going to<BR>> wait for sync to finish in
single user mode. I would not mind, if in a<BR>> situation like this
syncing the data back to the primary node, takes a<BR>> day, but it has to
be stable and the secondary node has to serve the<BR>> data in the
meantime.<BR>> <BR>> My drbd.conf:<BR>> resource drbd0
{<BR>> protocol = C<BR>> fsckcmd =
/bin/true<BR>> <BR>> disk
{<BR>> disk-size =
4890000k<BR>> do-panic<BR>>
}<BR>> <BR>> net {<BR>> sync-group =
0<BR>> sync-rate = 8M<BR>> sync-min =
5M<BR>> sync-max = 10M</FONT></DIV>
<DIV><FONT face=Arial color=#0000ff size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>if you read the example config file, you find
that -rate is deprecated<BR>synonym for -max. so don't use it.</FONT></DIV>
<DIV><FONT face=Arial color=#0000ff size=2></FONT> </DIV>
<DIV><FONT face=Arial><FONT size=2>sync-nice will have no effect when you set
-min to something which is<BR>way abavoe what you typically see, since as long
as the actuyl sync<BR>throughput is below -min, syncer sets itself to highest
possible<BR>priority.<SPAN class=337340310-20052004><FONT
color=#0000ff> </FONT></SPAN></FONT></FONT></DIV>
<DIV><FONT face=Arial><FONT size=2><SPAN
class=337340310-20052004></SPAN></FONT></FONT><FONT face=Arial color=#0000ff
size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>> sync-nice =
0<BR>> tl-size = 5000<BR>> <BR>>
ping-int = 10<BR>> timeout = 9</FONT></DIV>
<DIV><FONT face=Arial color=#0000ff size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>NO!<BR>of course you may *whish* that DRBD
recognises drop of connection in 0.9<BR>seconds. but do not do this on a
100MBit link, and not with 500MHz CPU.<BR>that will cause a packet storm
whenever the connection gets congested,<BR>congesting it even
more.</FONT></DIV>
<DIV> </DIV>
<DIV><FONT face=Arial size=2>timeout is in centiseconds! (read the example
config file!)<BR>default is 60, so why did you set it to 9 ?</FONT></DIV>
<DIV><SPAN class=337340310-20052004><FONT face=Arial color=#0000ff
size=2>Because if I'd set it to 60, it would display the error, that timeout
needs to be lower than ping-int and connect-int. I discovered, that if you
fiddle with these settings, you need to include all three in your config. I
now set ping-int and connect-int to 10 and timeout to 90.</FONT></SPAN></DIV>
<DIV><BR><FONT face=Arial size=2>> }<BR>>
<BR>> on syslogcs-cla {<BR>> device
= /dev/nb0<BR>> disk =
/dev/sdb2<BR>> address =
10.0.0.1<BR>> port = 7788<BR>>
}<BR>> <BR>> on syslogcs-clb
{<BR>> device =
/dev/nb0<BR>> disk =
/dev/md14<BR>> address =
10.0.0.2<BR>> port = 7788<BR>>
}<BR>> }<BR>> <BR>> resource drbd1 {<BR>>
protocol = C<BR>> fsckcmd = /bin/true<BR>>
<BR>> disk {<BR>> disk-size =
64700000k<BR>> do-panic<BR>>
}<BR>> <BR>> net {<BR>> sync-group =
1<BR>> sync-rate = 8M<BR>> sync-min =
5M<BR>> sync-max = 10M<BR>> sync-nice =
19</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>again, see comments for the sync section
above!</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>> tl-size = 5000<BR>>
<BR>> ping-int = 10<BR>> timeout =
9<BR>> }<BR>> <BR>> on syslogcs-cla
{<BR>> device =
/dev/nb1<BR>> disk =
/dev/sdb3<BR>> address =
10.0.0.1<BR>> port = 7789<BR>>
}<BR>> <BR>> on syslogcs-clb
{<BR>> device =
/dev/nb1<BR>> disk =
/dev/md15<BR>> address =
10.0.0.2<BR>> port = 7789<BR>>
}<BR>> }<BR>> /dev/md14 is RAID0 made of two RAID1 pairs (md9 &
md10)<BR>> /dev/md15 is RAID0 made of two RAID1 pairs (md11 &
md12)</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial><FONT size=2>hm.<BR>DRBD on MD on MD on SCSI ?
cute.<SPAN class=337340310-20052004><FONT
color=#0000ff> </FONT></SPAN></FONT></FONT></DIV>
<DIV><FONT face=Arial><FONT size=2><SPAN class=337340310-20052004><FONT
color=#0000ff>I was more like 'sweeet' that it worked. But then again
a block device = a block device = a block device =
a.......</FONT> </SPAN></FONT></FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>> <BR>> Output of mount commands:<BR>>
drbd1: blksize=1024 B<BR>> drbd1: blksize=4096 B<BR>> kjournald
starting. Commit interval 5 seconds<BR>> EXT3 FS 2.4-0.9.19, 19
August 2002 on drbd(43,1), internal journal<BR>> EXT3-fs: mounted
filesystem with ordered data mode.<BR>> Why the different block size? Both
disks have this when mounting.</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial><FONT size=2>because the superblock access of ext fs on
mount/umount is 1024.<BR>then it reads the fs parameters, and resets the block
size again to 4096.<SPAN class=337340310-20052004><FONT
color=#0000ff> </FONT></SPAN></FONT></FONT><FONT face=Arial><FONT
size=2><SPAN class=337340310-20052004> </SPAN></FONT></FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>> Sometimes I get the message, that md device
used obsolete ioctl, but<BR>> this should only be cosmetical.</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>yes.</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>> Sometimes got the message on the SW RAID
systdm, that block size<BR>> couldn't be determined and 512b was
assumed.</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>yes.</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>> The SW RAID seems to outperform the HW RAID
by 100%</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial><FONT size=2>yes. :)<SPAN class=337340310-20052004><FONT
color=#0000ff> </FONT></SPAN></FONT></FONT></DIV>
<DIV><FONT face=Arial><FONT size=2><SPAN class=337340310-20052004><FONT
color=#0000ff>Now, in the past few days it seemed more an more like this was
my problem. I did some speed testing</FONT> <FONT color=#0000ff>on
the HW and SW raid. I discovered, that write rates on the HW RAID just didn't
get any better than 5=6MB/s. As the sync rate was higher than that it
overloaded my system. I tried fiddling with syncmax, but the only thing I
seemd to achieve is to delay the locking up of both systems (ie even with
sync-max to 1MB/s, it would lock up after about
30mins).</FONT></SPAN></FONT></FONT></DIV>
<DIV><FONT face=Arial><FONT size=2><SPAN class=337340310-20052004><FONT
color=#0000ff>Luckily the DELL 2400 comes with BIOS option to switch from HW
raid to SCSI. I did and now both systems are SW RAID, no more lock ups, but
sync speed now seems to drop to around 700KB/s after 15-20
minutes</FONT></SPAN></FONT></FONT></DIV>
<DIV><FONT face=Arial><FONT color=#0000ff size=2><SPAN
class=337340310-20052004>Leaves the question though why under / over
performance of one side of the link causes both machines to lock up. This
shouldn't happen. Never.</SPAN></FONT></FONT></DIV>
<DIV><FONT face=Arial><FONT color=#0000ff size=2><SPAN
class=337340310-20052004></SPAN></FONT></FONT> </DIV>
<DIV><FONT face=Arial size=2>> On rare occasions I saw lock-ups of fsck or
mount during heartbeat<BR>> start-up. One time even causing entire system
to hang during reboot<BR>> (killall was not able to kill a hanging mount
process.)<BR>> Maybe also important info: Some md devices were syncing at
the same time<BR>> drbd devices were syncing. This too was not acheiving
high speeds. You<BR>> would expect this, when drbd sync uses 5MB, but not
when that drops. You<BR>> then would expect md sync to go faster, but it
didn't, it would stay at<BR>> 100-300KB/s.</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>so don't expect performance, when you thrash your
harddisks on an SMP<BR>box with several kernel threads *plus*
applications.<BR>no wonder your box goes south.</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial><FONT size=2>on a very different scale of course, but
what your are doing is similar<BR>to running a high performance database with
its backing storage being a<BR>floppy. if your hardware cannot cope with what
your software demands,<BR>than thats what you get: unresponsive systems at
best.<SPAN class=337340310-20052004><FONT
color=#0000ff> </FONT></SPAN></FONT></FONT></DIV>
<DIV><FONT face=Arial><FONT size=2><SPAN class=337340310-20052004><FONT
color=#0000ff>I disagree, syncing should NEVER trash harddisks. MD sync
doesn't. Sync prio just lowers as IO increases.</FONT> <FONT
color=#0000ff>Also the apps I were running, weren't even generating IO at the
time. Also, as the config might suggest, I'm setting up a syslog collector
cluster, which needs to process around 128Kbits/s of log data. I would
actually expect to be able to write that to a floppy, wouldn't I? So if, MD
sync is doing 100KB/s drbd sync should be able to do 10MB/s easily and vice
versa.</FONT></SPAN></FONT></FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>> <BR>> Lots of information, but
probably more needed. I will let you know if I<BR>> can reproduce the
problem, when I have created new datasets to test<BR>> with.</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial><FONT size=2>BTW,<BR>ext3 is known to have performance
problems on SMP under heavy io load<BR>anyways, which is because then writes
which where contiguous on UP (for<BR>the respective applications) become
scattered between multiple threads,<BR>and all aplications will have to seek
... you know what happens.<SPAN class=337340310-20052004><FONT
color=#0000ff> </FONT></SPAN></FONT></FONT></DIV>
<DIV><FONT face=Arial><FONT size=2><SPAN class=337340310-20052004><FONT
color=#0000ff>This would be entirely out of the question and irrelevant, as
the SMP Box was just receiving sync data and did not have the FS
mounted.</FONT> <FONT color=#0000ff>And as stated above, IO load is not
high, it is very low indeed.</FONT></SPAN></FONT></FONT></DIV>
<DIV><FONT face=Arial color=#0000ff size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2> Lars Ellenberg</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial color=#0000ff
size=2></FONT> </DIV></BLOCKQUOTE></BODY></HTML>