Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
/ 2004-02-01 14:36:47 -0500 \ Ward Horner: > I have DRBD setup with heartbeat doing an NFS export. When I transfer files > to the primary DRBD device I get the error (after about 1-2 minutes): > > drbd0: [kjournal/521] sock_sendmsg time expired, ko=42949672 this means that kjournal got stuck trying to send a block. It is NOT an error, but a hint that you have high latency, and obviously the primary writes faster than the secondary can cope. you can configure a "ko-count", which defaults to 0, (and then wraps around to 2**32-1 on the first count down, thus the high value). As soon as ko-count reaches zero (countdown every ping-intervall while block communication is stalled), the connection is teared down, and primary goes "StandAlone". maybe you have some sort of distributed resource deadlock. I understand that it is actually the secondary writing to the nfs share, which is the drbd exported by the primary? Is the drbd_asender_0 kernel thread still alive? Does the system recover if you do a sync on the Secondary? Or an emergency sync via SysRq ? (read about it in /usr/src/linux/Documentation/sysrq.txt) Or on the Primary? > This error then occurs continuously until I reboot the system. Also the > system has locked out all user control. I can telnet in, but once telnet > has established the connection, I cannot log in. Funny thing though is that > heartbeat does not transfer over to the secondary system, until I reboot > the primary. When I halt the primary and have the secondary takeover, I do > NOT get the error. I only get the error on the primary. above paragraph is confusing. please do not mix up node names and prefered roles, and actual drbd states! > I am using drbd-0.6.10+cvs. The latest file is drbd_main 1.88, Dec 22, > 2003. I looked at the more current release, but did not find any changes > that addressed this issue. > > I saw an email on Jan 8 from Andreas Schultz where he talks about a similar > problem during synchronization, but I have not seen a resolution. However, > there was mention to a "write_hint_dont_wait" patch. which is in cvs since drbd_main.c 1.88. > I am running DRBD on a Timesys Real-time Linux 4.0 system. This distro was > based on Debian and uses the 2.4.18 kernel. this might well be a distributed deadlock scenario related to your "real-time" kernel. > Any ideas what is going on here? > I used drbd-0.6.6 and never saw this problem. drbd 0.6.6 just ignored this and eventually got stalled completely in the same situation WITHOUT logging it. Lars Ellenberg