Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
I'm having a serious problem with drbd hard-locking my primary node. I'm using drbd along with heartbeat to create a fail-over cluster for an apache/php/mysql server. Data replication is being done over a Realtech Gigabit network adaptor. The nodes come up fine and initial replication takes place across the adaptor etc. All is well. I can take down the nodes manually and the other takes over and all is appears well. Then at some seemingly random time the primary node simply "hard-locks". There is nothing displayed on the console and nothing logged. At some lower level the machine appears to be functioning because the both interfaces are still responding to ping. The external interface responds normally but the internal interface (the one used for replication) is dropping a very high percentage of packets (60%). The setup is software raid 0 with drbd on top of it and then an ext3 files system mounted on top of that. The kernel is 2.4.22. A couple more quick notes; I believe this only ever happens when running in Primary/Secondary. What I mean is, when one machine is down the other one never crashes (though my testing is limited in this area). Also, when the machine is rebooted the replication starts. While the replication is running, pinging across the interface still results in about 50% packet loss. Could this be a cable problem and would that result in lock-ups? Here is cat /proc/drbd during syncing. 0: cs:SyncingAll st:Secondary/Primary ns:0 nr:81069096 dw:81069096 dr:0 pe:0 ua:17 [==============>.....] sync'ed: 71.7% (31375/110540)M finish: 0:10:39h speed: 55,406 (50,666) K/sec Can we really have a cable problem if we are getting ~50 M /second? There is absolutely no packet loss when the sync isn't running. The good news in all this is that the failover works flawlessly. I just don't want to be "testing" it so much ;) -- John Lange