Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Team, I have been working with the NFS folks for a bit now and am experiencing something strange which causes NFS clients to hang when attempting to use heartbeat/drbd/knfsd as a HA NFS server. I don't believe this is a NFS issue any longer. Setup: Machine A: primary drbd and HA machine (SMP machine) Machine B: secondary drbd and HA machine (UP machine) Machine C: NFS client using bonnie++ as a test for I/O activity ( -f -s 100 -n 1 -r 0 are the parameters ) All machines are Debian 4.0 etch with vanilla 2.6.22-rc2 kernel + known NFS issues patched When using machine A as a standalone server without drbd, just a flat disk partition with ext3 filesystem exported out via KNFSD, everything is wonderful. Machine C can use it just fine. When using machine B as a standalone server without drbd, a flat disk partition with ext3 filesystem exported out via KNFSD, it works as well. Machine C can use it just fine. When setting up Machine A and B with drbd (version 8.0.3) and serving out the drbd ext3 filesystem over KNFSD, strange hanging behavior is observed from machine C trying to get a read from the NFS server to which nothing on machine A (being the primary) is happening. The interesting thing here is that when you create local I/O via local bonnie++ program, suddenly the hanging goes away. It looks like to me something in the drbd kernel module is falling asleep and thus not letting KNFSD perform the requested action. Now here's the best(worst) part of the whole exercise, when you stop heartbeat on machine A and fail over to machine B, the failover is successful and machine C (the client) also fails over OK. At this point, starting the bonnie++ test program now works! So I have been asking myself "What is the difference between machine A and B that would cause the issue only on machine A". The network adapters on both A & B are realtek r8169 style adapters. The biggest difference I can think of is machine A is an SMP machine Athlon 64 X2 (running on a i386 kernel due to support issues with the 64bit kernel) while machine B is an older Athlon slot A processor. I have attached a triggered sysrq from machine A when the hanging of machine C is experienced. I'm hoping someone from the list can tell me what is going wrong with the drbd process and how to proceed next to fix this issue. I have these machines available for any tests that are relevant. Please help. Thanks! ____________________________________________________________________________________ Now that's room service! Choose from over 150,000 hotels in 45,000 destinations on Yahoo! Travel to find your fit. http://farechase.yahoo.com/promo-generic-14795097 -------------- next part -------------- A non-text attachment was scrubbed... Name: server-sysrq.txt.gz Type: application/gzip Size: 11318 bytes Desc: pat960968380 URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20070525/63d26479/attachment.bin>