Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
I installed heartbeat v 1.1.5 on the same two hp proliant dl380. Node names are cl1 and cl2 and cl1 is the active (and preferred) member. I powered off disks on cl1 to simulate a disaster and after few seconds i got a kernel seriuos error (of course) about the update process. After 2 (!) minutes i got a "low level" disk error, the systems locked and cl2 as a new active member. This is Ok but only the first time! All the other tests produced the update error and not the "low level" error. So cl1 stays up, answer to all eth* and serial request with no file over! This is a simple but critical condition. Another could be a process killed with sig 11 due to a ram trouble (the service down and all seem to be fine). The question is... how can i check the server or better the "services" are "really" up and running to inform heartbeat? Is there a good respawn tool? And so on, can i write a monitor plugin at application level and put it into hb? (for example a perl script checking e-mail, web and so on). Sorry for my bad english. Thank you Nicola Ranaldo