Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
>First off let me say that this is the first time I have been on a >mailing list like this. So if I am not following procedures correctly >please bear with me and let me know. I am going to provide as much >information about my problem as I can so I apologize in advance for >the length of this post. I am running on two Intel SC5200 servers >running Fedora Core 1. The RPM I used to install drbd was >drbd-0.6.8-1.i386.rpm (I dont know if there is a newer one, but this >is the one we have tested on identical servers with the same OS). We >have many sites running this configuration with no problem. However I >have one site that has been having problems for about 6 months. > > The problem we are having is that several times a day the users get >locked up momentarily. It usually only lasts a few seconds and their >session is restored. When this happened in the past I was >consistently seeing the following message in /var/log/messages when >it happened. > >Dec 5 13:11:04 state1 kernel: drbd0: sock_sendmsg returned 0 >Dec 5 13:11:04 state1 kernel: drbd0: Connection lost. >Dec 5 13:11:04 state1 kernel: drbd0: Connection established. >size=71673996 KB / > blksize=4096 B >Dec 5 13:11:04 state1 kernel: drbd0: Synchronisation started blks=15 >Dec 5 13:11:38 state1 kernel: drbd0: Synchronisation done. > >In addition to that I am also quite often seeing this in the >/var/log/ha-log however I don't get this message every time I get the one >above. > >heartbeat: 2005/12/05_13:11:04 WARN: Late heartbeat: Node state1: interval >18050 > ms > > I have tried several things in an effort to alleviate this problem >and so far none of my attempts have yielded any positive results. I >have reinstalled the OS and drbd on both machines. We tried a new >network card but unfortunately at the time we were only able to do >that on one server at a time (one server was always using the >original NIC). Recently someone pointed me to the NIC's as more than >likely being the cause of this problem, and suggested I take a look >at those as well as checking the IRQ's. There was a serious problem >(IMHO) with the interrupts. Both the pci and the onboard SCSI >controllers, both onboard NIC's, and the usb-ohci were all on the >same interrupt. I thought for sure that this would be what was >causing the problem. However after I was able to force all these >devices onto their own seperate IRQ's (with the exception of one SCSI >and the usb-ohci which still share a IRQ) we are still having the >problem. > > One thing that I find strange is that I am not seeing the drbd >messages in the logs as frequently, but my users are still getting >several lockups a day. And then this morning I saw some things in the >logs that didn't make any sense to me. Last night the mirroring and >clustering on the backup server was turned off at around midnight so >that they would not continue to get locked up throughout the night. >Then when I checked the logs this morning I was seeing this. > >heartbeat: 2005/12/08_00:05:44 info: Received shutdown notice >from 'state2'. >heartbeat: 2005/12/08_00:05:44 info: Resources being acquired from state2. >heartbeat: 2005/12/08_00:05:44 info: Running /etc/ha.d/rc.d/status status >heartbeat: 2005/12/08_00:05:44 info: Taking over resource group >14.84.14.149 >heartbeat: 2005/12/08_00:05:44 info: No local resources >[/usr/lib/heartbeat/Reso >urceManager listkeys state1] >heartbeat: 2005/12/08_00:05:44 info: Resource acquisition completed. >heartbeat: 2005/12/08_00:05:45 info: /usr/lib/heartbeat/mach_down: >nice_failback >: acquiring foreign resources >heartbeat: 2005/12/08_00:05:45 info: mach_down takeover complete. >heartbeat: 2005/12/08_00:05:45 info: mach_down takeover complete for node >state2 >. >heartbeat: 2005/12/08_00:06:30 WARN: node state2: is dead >heartbeat: 2005/12/08_00:06:30 info: Dead node state2 held no resources. >heartbeat: 2005/12/08_00:06:30 info: Resources being acquired from state2. >heartbeat: 2005/12/08_00:06:30 info: Link state2:eth1 10.1.13.3 dead. >heartbeat: 2005/12/08_00:06:30 info: Running /etc/ha.d/rc.d/status status >heartbeat: 2005/12/08_00:06:30 info: No local resources >[/usr/lib/heartbeat/Reso >urceManager listkeys state1] >heartbeat: 2005/12/08_00:06:30 info: Resource acquisition completed. >heartbeat: 2005/12/08_00:06:30 info: Taking over resource group >14.84.14.149 >heartbeat: 2005/12/08_00:06:30 info: /usr/lib/heartbeat/mach_down: >nice_failback >: acquiring foreign resources >heartbeat: 2005/12/08_00:06:30 info: mach_down takeover complete. >heartbeat: 2005/12/08_00:06:30 info: mach_down takeover complete for node >state2 >. > > Even though the mirroring and clustering was turned off last night at >midnight. This may or may not be relevant, I'm not really sure >because I havent seen it before when the backup server isn't >mirroring or clustering. > > If there is any additional information needed (which wouldnt surprise >me) please don't hesitate to let me know. I will reply as quickly as >possible. Any assistance in getting this matter resolved will be more >than greatly appreciated. > >Thanks >Alex Kerr >Dice Corp. Ok so I noticed something strange with my backup system on the partitioning so I decided to reinstall the OS and drbd. It took a little work to get the primary to replicate back to the newly installed secondary but I do have it replicating now. But I am concerned with this message I am getting in /var/log/ha-log on the primary now. It comes in every 5 seconds. Does anyone know what this is??? heartbeat: 2005/12/08_20:42:13 ERROR: should_drop_message: attempted replay atta ck [state2]? [curgen = 22] heartbeat: 2005/12/08_20:42:18 ERROR: should_drop_message: attempted replay atta ck [state2]? [curgen = 22] heartbeat: 2005/12/08_20:42:23 ERROR: should_drop_message: attempted replay atta ck [state2]? [curgen = 22] heartbeat: 2005/12/08_20:42:28 ERROR: should_drop_message: attempted replay atta ck [state2]? [curgen = 22] heartbeat: 2005/12/08_20:42:33 ERROR: should_drop_message: attempted replay atta ck [state2]? [curgen = 22] heartbeat: 2005/12/08_20:42:38 ERROR: should_drop_message: attempted replay atta ck [state2]? [curgen = 22] heartbeat: 2005/12/08_20:42:43 ERROR: should_drop_message: attempted replay atta ck [state2]? [curgen = 22] heartbeat: 2005/12/08_20:42:48 ERROR: should_drop_message: attempted replay atta ck [state2]? [curgen = 22] heartbeat: 2005/12/08_20:42:53 ERROR: should_drop_message: attempted replay atta ck [state2]? [curgen = 22] heartbeat: 2005/12/08_20:42:58 ERROR: should_drop_message: attempted replay atta ck [state2]? [curgen = 22] Thanks Alex Kerr Dice Corp.