[DRBD-user] New error message

Fri Dec 9 02:54:13 CET 2005

>First off let me say that this is the first time I have been on a
>mailing list like this. So if I am not following procedures correctly
>please bear with me and let me know. I am going to provide as much
>information about my problem as I can so I apologize in advance for
>the length of this post. I am running on two Intel SC5200 servers
>running Fedora Core 1. The RPM I used to install drbd was
>drbd-0.6.8-1.i386.rpm (I dont know if there is a newer one, but this
>is the one we have tested on identical servers with the same OS). We
>have many sites running this configuration with no problem. However I
>have one site that has been having problems for about 6 months.
>
>     The problem we are having is that several times a day the users get
>locked up momentarily. It usually only lasts a few seconds and their
>session is restored. When this happened in the past I was
>consistently seeing the following message in /var/log/messages when
>it happened.
>
>Dec  5 13:11:04 state1 kernel: drbd0: sock_sendmsg returned 0
>Dec  5 13:11:04 state1 kernel: drbd0: Connection lost.
>Dec  5 13:11:04 state1 kernel: drbd0: Connection established.
>size=71673996 KB /
> blksize=4096 B
>Dec  5 13:11:04 state1 kernel: drbd0: Synchronisation started blks=15
>Dec  5 13:11:38 state1 kernel: drbd0: Synchronisation done.
>
>In addition to that I am also quite often seeing this in the
>/var/log/ha-log however I don't get this message every time I get the one
>above.
>
>heartbeat: 2005/12/05_13:11:04 WARN: Late heartbeat: Node state1: interval
>18050
> ms
>
>     I have tried several things in an effort to alleviate this problem
>and so far none of my attempts have yielded any positive results. I
>have reinstalled the OS and drbd on both machines. We tried a new
>network card but unfortunately at the time we were only able to do
>that on one server at a time (one server was always using the
>original NIC). Recently someone pointed me to the NIC's as more than
>likely being the cause of this problem, and suggested I take a look
>at those as well as checking the IRQ's. There was a serious problem
>(IMHO) with the interrupts. Both the pci and the onboard SCSI
>controllers, both onboard NIC's, and the usb-ohci were all on the
>same interrupt. I thought for sure that this would be what was
>causing the problem. However after I was able to force all these
>devices onto their own seperate IRQ's (with the exception of one SCSI
>and the usb-ohci which still share a IRQ) we are still having the
>problem.
>
>     One thing that I find strange is that I am not seeing the drbd
>messages in the logs as frequently, but my users are still getting
>several lockups a day. And then this morning I saw some things in the
>logs that didn't make any sense to me. Last night the mirroring and
>clustering on the backup server was turned off at around midnight so
>that they would not continue to get locked up throughout the night.
>Then when I checked the logs this morning I was seeing this.
>
>heartbeat: 2005/12/08_00:05:44 info: Received shutdown notice >from
'state2'.
>heartbeat: 2005/12/08_00:05:44 info: Resources being acquired from state2.
>heartbeat: 2005/12/08_00:05:44 info: Running /etc/ha.d/rc.d/status status
>heartbeat: 2005/12/08_00:05:44 info: Taking over resource group
>14.84.14.149
>heartbeat: 2005/12/08_00:05:44 info: No local resources
>[/usr/lib/heartbeat/Reso
>urceManager listkeys state1]
>heartbeat: 2005/12/08_00:05:44 info: Resource acquisition completed.
>heartbeat: 2005/12/08_00:05:45 info: /usr/lib/heartbeat/mach_down:
>nice_failback
>: acquiring foreign resources
>heartbeat: 2005/12/08_00:05:45 info: mach_down takeover complete.
>heartbeat: 2005/12/08_00:05:45 info: mach_down takeover complete for node
>state2
>.
>heartbeat: 2005/12/08_00:06:30 WARN: node state2: is dead
>heartbeat: 2005/12/08_00:06:30 info: Dead node state2 held no resources.
>heartbeat: 2005/12/08_00:06:30 info: Resources being acquired from state2.
>heartbeat: 2005/12/08_00:06:30 info: Link state2:eth1 10.1.13.3 dead.
>heartbeat: 2005/12/08_00:06:30 info: Running /etc/ha.d/rc.d/status status
>heartbeat: 2005/12/08_00:06:30 info: No local resources
>[/usr/lib/heartbeat/Reso
>urceManager listkeys state1]
>heartbeat: 2005/12/08_00:06:30 info: Resource acquisition completed.
>heartbeat: 2005/12/08_00:06:30 info: Taking over resource group
>14.84.14.149
>heartbeat: 2005/12/08_00:06:30 info: /usr/lib/heartbeat/mach_down:
>nice_failback
>: acquiring foreign resources
>heartbeat: 2005/12/08_00:06:30 info: mach_down takeover complete.
>heartbeat: 2005/12/08_00:06:30 info: mach_down takeover complete for node
>state2
>.
>
>     Even though the mirroring and clustering was turned off last night at
>midnight. This may or may not be relevant, I'm not really sure
>because I havent seen it before when the backup server isn't
>mirroring or clustering.
>
>     If there is any additional information needed (which wouldnt surprise
>me) please don't hesitate to let me know. I will reply as quickly as
>possible. Any assistance in getting this matter resolved will be more
>than greatly appreciated.
>
>Thanks
>Alex Kerr
>Dice Corp.

    Ok so I noticed something strange with my backup system on the
partitioning so I decided to reinstall the OS and drbd. It took a
little work to get the primary to replicate back to the newly
installed secondary but I do have it replicating now. But I am
concerned with this message I am getting in /var/log/ha-log on the
primary now. It comes in every 5 seconds. Does anyone know what this
is???

heartbeat: 2005/12/08_20:42:13 ERROR: should_drop_message: attempted
replay atta
ck [state2]? [curgen = 22]
heartbeat: 2005/12/08_20:42:18 ERROR: should_drop_message: attempted
replay atta
ck [state2]? [curgen = 22]
heartbeat: 2005/12/08_20:42:23 ERROR: should_drop_message: attempted
replay atta
ck [state2]? [curgen = 22]
heartbeat: 2005/12/08_20:42:28 ERROR: should_drop_message: attempted
replay atta
ck [state2]? [curgen = 22]
heartbeat: 2005/12/08_20:42:33 ERROR: should_drop_message: attempted
replay atta
ck [state2]? [curgen = 22]
heartbeat: 2005/12/08_20:42:38 ERROR: should_drop_message: attempted
replay atta
ck [state2]? [curgen = 22]
heartbeat: 2005/12/08_20:42:43 ERROR: should_drop_message: attempted
replay atta
ck [state2]? [curgen = 22]
heartbeat: 2005/12/08_20:42:48 ERROR: should_drop_message: attempted
replay atta
ck [state2]? [curgen = 22]
heartbeat: 2005/12/08_20:42:53 ERROR: should_drop_message: attempted
replay atta
ck [state2]? [curgen = 22]
heartbeat: 2005/12/08_20:42:58 ERROR: should_drop_message: attempted
replay atta
ck [state2]? [curgen = 22]

Thanks
Alex Kerr
Dice Corp.