[DRBD-user] Possible issue with DRBD and KNFSD

John Frisk john_a_frisk at yahoo.com
Fri May 25 09:41:15 CEST 2007

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Team,
I have been working with the NFS folks for a bit now
and am experiencing something strange which causes NFS
clients to hang when attempting to use
heartbeat/drbd/knfsd as a HA NFS server.  I don't
believe this is a NFS issue any longer.  

Setup:
Machine A: primary drbd and HA machine (SMP machine)
Machine B: secondary drbd and HA machine (UP machine)
Machine C: NFS client using bonnie++ as a test for I/O
activity ( -f -s 100 -n 1 -r 0 are the parameters )
All machines are Debian 4.0 etch with vanilla
2.6.22-rc2 kernel + known NFS issues patched

When using machine A as a standalone server without
drbd, just a flat disk partition with ext3 filesystem
exported out via KNFSD, everything is wonderful. 
Machine C can use it just fine.

When using machine B as a standalone server without
drbd, a flat disk partition with ext3 filesystem
exported out via KNFSD, it works as well.  Machine C
can use it just fine.

When setting up Machine A and B with drbd (version
8.0.3) and serving out the drbd ext3 filesystem over
KNFSD, strange hanging behavior is observed from
machine C trying to get a read from the NFS server to
which nothing on machine A (being the primary) is
happening.  The interesting thing here is that when
you create local I/O via local bonnie++ program,
suddenly the hanging goes away.  It looks like to me
something in the drbd kernel module is falling asleep
and thus not letting KNFSD perform the requested
action.  

Now here's the best(worst) part of the whole exercise,
when you stop heartbeat on machine A and fail over to
machine B, the failover is successful and machine C
(the client) also fails over OK.  At this point,
starting the bonnie++ test program now works!

So I have been asking myself "What is the difference
between machine A and B that would cause the issue
only on machine A".  The network adapters on both A &
B are realtek r8169 style adapters.  The biggest
difference I can think of is machine A is an SMP
machine Athlon 64 X2 (running on a i386 kernel due to
support issues with the 64bit kernel) while machine B
is an older Athlon slot A processor.

I have attached a triggered sysrq from machine A when
the hanging of machine C is experienced.  I'm hoping
someone from the list can tell me what is going wrong
with the drbd process and how to proceed next to fix
this issue.  I have these machines available for any
tests that are relevant.  Please help.
Thanks!


 
____________________________________________________________________________________
Now that's room service!  Choose from over 150,000 hotels
in 45,000 destinations on Yahoo! Travel to find your fit.
http://farechase.yahoo.com/promo-generic-14795097
-------------- next part --------------
A non-text attachment was scrubbed...
Name: server-sysrq.txt.gz
Type: application/gzip
Size: 11318 bytes
Desc: pat960968380
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20070525/63d26479/attachment.bin>


More information about the drbd-user mailing list