[DRBD-user] lockup on primary node

Fri Jul 16 20:15:24 CEST 2004

I'm having a serious problem with drbd hard-locking my primary node.

I'm using drbd along with heartbeat to create a fail-over cluster for an
apache/php/mysql server.

Data replication is being done over a Realtech Gigabit network adaptor.

The nodes come up fine and initial replication takes place across the
adaptor etc. All is well. I can take down the nodes manually and the
other takes over and all is appears well.

Then at some seemingly random time the primary node simply "hard-locks".
There is nothing displayed on the console and nothing logged.

At some lower level the machine appears to be functioning because the
both interfaces are still responding to ping. The external interface
responds normally but the internal interface (the one used for
replication) is dropping a very high percentage of packets (60%).

The setup is software raid 0 with drbd on top of it and then an ext3
files system mounted on top of that.

The kernel is 2.4.22.

A couple more quick notes; I believe this only ever happens when running
in Primary/Secondary. What I mean is, when one machine is down the other
one never crashes (though my testing is limited in this area).

Also, when the machine is rebooted the replication starts. While the
replication is running, pinging across the interface still results in
about 50% packet loss. Could this be a cable problem and would that
result in lock-ups?

Here is cat /proc/drbd during syncing.

0: cs:SyncingAll st:Secondary/Primary ns:0 nr:81069096 dw:81069096 dr:0
pe:0 ua:17
        [==============>.....] sync'ed: 71.7% (31375/110540)M
        finish: 0:10:39h speed: 55,406 (50,666) K/sec

Can we really have a cable problem if we are getting ~50 M /second?

There is absolutely no packet loss when the sync isn't running.

The good news in all this is that the failover works flawlessly. I just
don't want to be "testing" it so much ;)

-- 
John Lange