[DRBD-user] On-going problem with drbd locking up users.

Thu Dec 8 15:22:47 CET 2005

     First off let me say that this is the first time I have been on a
mailing list like this. So if I am not following procedures correctly
please bear with me and let me know. I am going to provide as much
information about my problem as I can so I apologize in advance for
the length of this post. I am running on two Intel SC5200 servers
running Fedora Core 1. The RPM I used to install drbd was
drbd-0.6.8-1.i386.rpm (I dont know if there is a newer one, but this
is the one we have tested on identical servers with the same OS). We
have many sites running this configuration with no problem. However I
have one site that has been having problems for about 6 months.

     The problem we are having is that several times a day the users get
locked up momentarily. It usually only lasts a few seconds and their
session is restored. When this happened in the past I was
consistently seeing the following message in /var/log/messages when
it happened.

Dec  5 13:11:04 state1 kernel: drbd0: sock_sendmsg returned 0
Dec  5 13:11:04 state1 kernel: drbd0: Connection lost.
Dec  5 13:11:04 state1 kernel: drbd0: Connection established.
size=71673996 KB /
 blksize=4096 B
Dec  5 13:11:04 state1 kernel: drbd0: Synchronisation started blks=15
Dec  5 13:11:38 state1 kernel: drbd0: Synchronisation done.

In addition to that I am also quite often seeing this in the
/var/log/ha-log however I don't get this message every time I get the one
above.

heartbeat: 2005/12/05_13:11:04 WARN: Late heartbeat: Node state1: interval
18050
 ms

     I have tried several things in an effort to alleviate this problem
and so far none of my attempts have yielded any positive results. I
have reinstalled the OS and drbd on both machines. We tried a new
network card but unfortunately at the time we were only able to do
that on one server at a time (one server was always using the
original NIC). Recently someone pointed me to the NIC's as more than
likely being the cause of this problem, and suggested I take a look
at those as well as checking the IRQ's. There was a serious problem
(IMHO) with the interrupts. Both the pci and the onboard SCSI
controllers, both onboard NIC's, and the usb-ohci were all on the
same interrupt. I thought for sure that this would be what was
causing the problem. However after I was able to force all these
devices onto their own seperate IRQ's (with the exception of one SCSI
and the usb-ohci which still share a IRQ) we are still having the
problem.

     One thing that I find strange is that I am not seeing the drbd
messages in the logs as frequently, but my users are still getting
several lockups a day. And then this morning I saw some things in the
logs that didn't make any sense to me. Last night the mirroring and
clustering on the backup server was turned off at around midnight so
that they would not continue to get locked up throughout the night.
Then when I checked the logs this morning I was seeing this.

heartbeat: 2005/12/08_00:05:44 info: Received shutdown notice from 'state2'.
heartbeat: 2005/12/08_00:05:44 info: Resources being acquired from state2.
heartbeat: 2005/12/08_00:05:44 info: Running /etc/ha.d/rc.d/status status
heartbeat: 2005/12/08_00:05:44 info: Taking over resource group 14.84.14.149
heartbeat: 2005/12/08_00:05:44 info: No local resources
[/usr/lib/heartbeat/Reso
urceManager listkeys state1]
heartbeat: 2005/12/08_00:05:44 info: Resource acquisition completed.
heartbeat: 2005/12/08_00:05:45 info: /usr/lib/heartbeat/mach_down:
nice_failback
: acquiring foreign resources
heartbeat: 2005/12/08_00:05:45 info: mach_down takeover complete.
heartbeat: 2005/12/08_00:05:45 info: mach_down takeover complete for node
state2
.
heartbeat: 2005/12/08_00:06:30 WARN: node state2: is dead
heartbeat: 2005/12/08_00:06:30 info: Dead node state2 held no resources.
heartbeat: 2005/12/08_00:06:30 info: Resources being acquired from state2.
heartbeat: 2005/12/08_00:06:30 info: Link state2:eth1 10.1.13.3 dead.
heartbeat: 2005/12/08_00:06:30 info: Running /etc/ha.d/rc.d/status status
heartbeat: 2005/12/08_00:06:30 info: No local resources
[/usr/lib/heartbeat/Reso
urceManager listkeys state1]
heartbeat: 2005/12/08_00:06:30 info: Resource acquisition completed.
heartbeat: 2005/12/08_00:06:30 info: Taking over resource group 14.84.14.149
heartbeat: 2005/12/08_00:06:30 info: /usr/lib/heartbeat/mach_down:
nice_failback
: acquiring foreign resources
heartbeat: 2005/12/08_00:06:30 info: mach_down takeover complete.
heartbeat: 2005/12/08_00:06:30 info: mach_down takeover complete for node
state2
.

     Even though the mirroring and clustering was turned off last night at
midnight. This may or may not be relevant, I'm not really sure
because I havent seen it before when the backup server isn't
mirroring or clustering.

     If there is any additional information needed (which wouldnt surprise
me) please don't hesitate to let me know. I will reply as quickly as
possible. Any assistance in getting this matter resolved will be more
than greatly appreciated.

Thanks
Alex Kerr
Dice Corp.