Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
First off let me say that this is the first time I have been on a mailing list like this. So if I am not following procedures correctly please bear with me and let me know. I am going to provide as much information about my problem as I can so I apologize in advance for the length of this post. I am running on two Intel SC5200 servers running Fedora Core 1. The RPM I used to install drbd was drbd-0.6.8-1.i386.rpm (I dont know if there is a newer one, but this is the one we have tested on identical servers with the same OS). We have many sites running this configuration with no problem. However I have one site that has been having problems for about 6 months. The problem we are having is that several times a day the users get locked up momentarily. It usually only lasts a few seconds and their session is restored. When this happened in the past I was consistently seeing the following message in /var/log/messages when it happened. Dec 5 13:11:04 state1 kernel: drbd0: sock_sendmsg returned 0 Dec 5 13:11:04 state1 kernel: drbd0: Connection lost. Dec 5 13:11:04 state1 kernel: drbd0: Connection established. size=71673996 KB / blksize=4096 B Dec 5 13:11:04 state1 kernel: drbd0: Synchronisation started blks=15 Dec 5 13:11:38 state1 kernel: drbd0: Synchronisation done. In addition to that I am also quite often seeing this in the /var/log/ha-log however I don't get this message every time I get the one above. heartbeat: 2005/12/05_13:11:04 WARN: Late heartbeat: Node state1: interval 18050 ms I have tried several things in an effort to alleviate this problem and so far none of my attempts have yielded any positive results. I have reinstalled the OS and drbd on both machines. We tried a new network card but unfortunately at the time we were only able to do that on one server at a time (one server was always using the original NIC). Recently someone pointed me to the NIC's as more than likely being the cause of this problem, and suggested I take a look at those as well as checking the IRQ's. There was a serious problem (IMHO) with the interrupts. Both the pci and the onboard SCSI controllers, both onboard NIC's, and the usb-ohci were all on the same interrupt. I thought for sure that this would be what was causing the problem. However after I was able to force all these devices onto their own seperate IRQ's (with the exception of one SCSI and the usb-ohci which still share a IRQ) we are still having the problem. One thing that I find strange is that I am not seeing the drbd messages in the logs as frequently, but my users are still getting several lockups a day. And then this morning I saw some things in the logs that didn't make any sense to me. Last night the mirroring and clustering on the backup server was turned off at around midnight so that they would not continue to get locked up throughout the night. Then when I checked the logs this morning I was seeing this. heartbeat: 2005/12/08_00:05:44 info: Received shutdown notice from 'state2'. heartbeat: 2005/12/08_00:05:44 info: Resources being acquired from state2. heartbeat: 2005/12/08_00:05:44 info: Running /etc/ha.d/rc.d/status status heartbeat: 2005/12/08_00:05:44 info: Taking over resource group 14.84.14.149 heartbeat: 2005/12/08_00:05:44 info: No local resources [/usr/lib/heartbeat/Reso urceManager listkeys state1] heartbeat: 2005/12/08_00:05:44 info: Resource acquisition completed. heartbeat: 2005/12/08_00:05:45 info: /usr/lib/heartbeat/mach_down: nice_failback : acquiring foreign resources heartbeat: 2005/12/08_00:05:45 info: mach_down takeover complete. heartbeat: 2005/12/08_00:05:45 info: mach_down takeover complete for node state2 . heartbeat: 2005/12/08_00:06:30 WARN: node state2: is dead heartbeat: 2005/12/08_00:06:30 info: Dead node state2 held no resources. heartbeat: 2005/12/08_00:06:30 info: Resources being acquired from state2. heartbeat: 2005/12/08_00:06:30 info: Link state2:eth1 10.1.13.3 dead. heartbeat: 2005/12/08_00:06:30 info: Running /etc/ha.d/rc.d/status status heartbeat: 2005/12/08_00:06:30 info: No local resources [/usr/lib/heartbeat/Reso urceManager listkeys state1] heartbeat: 2005/12/08_00:06:30 info: Resource acquisition completed. heartbeat: 2005/12/08_00:06:30 info: Taking over resource group 14.84.14.149 heartbeat: 2005/12/08_00:06:30 info: /usr/lib/heartbeat/mach_down: nice_failback : acquiring foreign resources heartbeat: 2005/12/08_00:06:30 info: mach_down takeover complete. heartbeat: 2005/12/08_00:06:30 info: mach_down takeover complete for node state2 . Even though the mirroring and clustering was turned off last night at midnight. This may or may not be relevant, I'm not really sure because I havent seen it before when the backup server isn't mirroring or clustering. If there is any additional information needed (which wouldnt surprise me) please don't hesitate to let me know. I will reply as quickly as possible. Any assistance in getting this matter resolved will be more than greatly appreciated. Thanks Alex Kerr Dice Corp.