[DRBD-user] drbd 0.6.12 + heartbeat + synchronization + machine load

Mon May 17 12:20:14 CEST 2004

Hello,

I am testing drbd + heartbeat for an HA setup consisiting of two cluster
members. The first is A Dell 2400, 256MB, Dual PIII 500, HW Raid. The
second is a Dell 2300, 128Mb, Single PIII500, Soft RAID. Both systems
are running RedHat 9 with 2.4.20-31.9smp kernel (the single proc box
because of a bug in the 440GX chipset, APIC only works when running SMP
kernel). I am using 0.6.12 as 0.7 seemed hell on my machines (loads of
kernel oopses, panics, hangs etc.). So far I've been having good
results. Tested failover between nodes, which all worked well. Until I
decided to test the all out disaster scenario.

First I took down my primary cluster node (did this by disconnecting all
NIC's). Failover went well as expected. Then decided to go for all-out
by gracefully shutting down the secondary node. In this scenario you
would boot up the secondary cluster node first, as that would have the
latest data set. And as I want HA, decided not to wait for the other
side of drbd to show up and make disks primary. Up until this point
still no problem, disks would be mounted and data served from the
secondary cluster node. 

But when I booted my primary cluster node, shit did really hit the van
(you should see my office, it smells terrible ;-). As soon as it started
replicating off data from the secondary cluster node, problems started.
Immediately both of the nodes were showing lock-up problems (eg, not
able to log in on console / ssh etc.). Already logged in sessions kept
working except for doing su would lock up also. A 'cat /proc/drbd' would
initially show acceptable speeds (around 5MB/s, my sync min. Syncing
from primary node to secondary would reach 10MB/s+). Also the system
load would slowly increase up unto the point where heartbeat generated
failover: (If I run softdog, it would even just reset the machine)
 11:09:37  up 10:23,  1 user,  load average: 3.58, 3.00, 2.41
85 processes: 75 sleeping, 7 running, 3 zombie, 0 stopped
CPU states:  70.9% user  29.0% system   0.0% nice   0.0% iowait   0.0%
idle
Mem:   125412k av,  122820k used,    2592k free,       0k shrd,   36628k
buff
                     78112k actv,     796k in_d,    1624k in_c
Swap:  787064k av,    1184k used,  785880k free                   54192k
cached
(CPU was usually not at 100%, but more like 25 to 30%), Load 3+ on a
Single CPU machine, while not using that much mem and cpu time, that's
weird.

Also at this point sync speeds would drop to under 1MB/s. Plus the
console got overloaded with these messages:
drbd1: [drbd_syncer_1/4321] sock_sendmsg time expired, ko = 4294967295
drbd1: [drbd_syncer_1/4321] sock_sendmsg time expired, ko = 4294967294
drbd1: [drbd_syncer_1/4321] sock_sendmsg time expired, ko = 4294967295
drbd1: [drbd_syncer_1/4321] sock_sendmsg time expired, ko = 4294967295
drbd1: [drbd_syncer_1/4321] sock_sendmsg time expired, ko = 4294967294
drbd1: [drbd_syncer_1/4321] sock_sendmsg time expired, ko = 4294967295
drbd1: [drbd_syncer_1/4321] sock_sendmsg time expired, ko = 4294967294
drbd1: [drbd_syncer_1/4321] sock_sendmsg time expired, ko = 4294967295
drbd1: [drbd_syncer_1/4321] sock_sendmsg time expired, ko = 4294967295
drbd1: [drbd_syncer_1/4321] sock_sendmsg time expired, ko = 4294967294
drbd1: [drbd_syncer_1/4321] sock_sendmsg time expired, ko = 4294967295
drbd1: [drbd_syncer_1/4321] sock_sendmsg time expired, ko = 4294967294
drbd1: [drbd_syncer_1/4321] sock_sendmsg time expired, ko = 4294967295
drbd1: [drbd_syncer_1/4321] sock_sendmsg time expired, ko = 4294967295

I've tryd fiddling with sync parameters (sync-nice, sync-group, tl-size,
etc.) nothing helped, although symptoms did vary (time before lock-ups
of system, time before HB failed over, less or more of these
sock_sendmsg messages).

As soon as Hearbeat had shut itself down, sync speed would sometimes go
up again, but other tiimes remained low. Same thing with the load
somtimes went down to normal values, sometimes not. System lock ups too.
Stopping the sync by disconnecting the secondary cluster node always
brought sysmtems back to normal.

The only way systems remained stable was doing the sync in single user
mode. But as it's 70GB of data we're talking about and 5MB/s sync would
take 3hrs+, this would be unacceptable downtime. I will now start with a
new dataset and see if I can reproduce the problem. I am not going to
wait for sync to finish in single user mode. I would not mind, if in a
situation like this syncing the data back to the primary node, takes a
day, but it has to be stable and the secondary node has to serve the
data in the meantime.

My drbd.conf:
resource drbd0 {
  protocol = C
  fsckcmd = /bin/true

  disk {
    disk-size = 4890000k
    do-panic
  }

  net {
  sync-group = 0
  sync-rate = 8M
  sync-min = 5M
  sync-max = 10M
  sync-nice = 0
  tl-size = 5000

  ping-int = 10
  timeout = 9
  }

  on syslogcs-cla {
    device = /dev/nb0
    disk = /dev/sdb2
    address = 10.0.0.1
    port = 7788
  }

  on syslogcs-clb {
    device = /dev/nb0
    disk = /dev/md14
    address = 10.0.0.2
    port = 7788
  }
}

resource drbd1 {
  protocol = C
  fsckcmd = /bin/true

  disk {
    disk-size = 64700000k
    do-panic
  }

  net {
  sync-group = 1
  sync-rate = 8M
  sync-min = 5M
  sync-max = 10M
  sync-nice = 19
  tl-size = 5000

  ping-int = 10
  timeout = 9
  }

  on syslogcs-cla {
    device = /dev/nb1
    disk = /dev/sdb3
    address = 10.0.0.1
    port = 7789
  }

  on syslogcs-clb {
    device = /dev/nb1
    disk = /dev/md15
    address = 10.0.0.2
    port = 7789
  }
}
/dev/md14 is RAID0 made of two RAID1 pairs (md9 & md10)
/dev/md15 is RAID0 made of two RAID1 pairs (md11 & md12)

Output of mount commands:
drbd1: blksize=1024 B
drbd1: blksize=4096 B
kjournald starting.  Commit interval 5 seconds
EXT3 FS 2.4-0.9.19, 19 August 2002 on drbd(43,1), internal journal
EXT3-fs: mounted filesystem with ordered data mode.
Why the different block size? Both disks have this when mounting.

Sometimes I get the message, that md device used obsolete ioctl, but
this should only be cosmetical.
Sometimes got the message on the SW RAID systdm, that block size
couldn't be determined and 512b was assumed.
The SW RAID seems to outperform the HW RAID by 100%
On rare occasions I saw lock-ups of fsck or mount during heartbeat
start-up. One time even causing entire system to hang during reboot
(killall was not able to kill a hanging mount process.)
Maybe also important info: Some md devices were syncing at the same time
drbd devices were syncing. This too was not acheiving high speeds. You
would expect this, when drbd sync uses 5MB, but not when that drops. You
then would expect md sync to go faster, but it didn't, it would stay at
100-300KB/s.

Lots of information, but probably more needed. I will let you know if I
can reproduce the problem, when I have created new datasets to test
with.

Sietse
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20040517/9a37bd93/attachment.htm>