Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Tue, Apr 10, 2007 at 10:16:25AM -0600, Dmitry S. Makovey wrote: > On Friday 06 April 2007 05:10, Lars Ellenberg wrote: > > > More background info: > > > > > > Systems: > > > redhat-release-3AS-13.7.3 > > > > > > DRBD: > > > drbd-0.7.17-17 and you really want to upgrade to .22 / .23 see http://svn.drbd.org/drbd/branches/drbd-0.7/ChangeLog > > > drbd-km-2.4.21_40.ELsmp-0.7.17-17 > > > drbd-km-2.4.21_40.EL-0.7.17-17 > > > > does the problem persist with 0.7.23 ? > > can you reproduce this with a 2.6 kernel? > > unfortunately -those two are production machines and I'm limited somewhat in > my testing. However we are planning on "re-playing" same scenario on testing > boxes. > > > any kernel messages? > > please describe "some weird state" in more detail. > > those are logs from "induced" "wierd state" (i.e. I intentionally set > secondary machine's firewall up with rules that cause trouble in the first > run). Message from 10:38 is the one that floods logs. the rest is appearing > once or twice here and there. While machine is in that state - it doesn't > respond to *anything* until secondary is "down". > > Apr 4 10:38:14 XboxXkernel: drbd0: [kjournald/3861] sock_sendmsg time > expired, ko = 4294967294 right. all as expected and explainable. I recommend you set something like "ko-count = 7" in drbd.conf. it defaults to "0" thus off. what this is about: drbd tries to send a data packet. that times out. it would eventually give up, but with default tcp timeouts you'd have to wait for about a week or so. whenever a drbd data packet times out, we sent a "drbd-ping" packet on the other socket. when that does not get answered either, we drop the connection, and try to reconnect. if the ping gets answered, we conclude the peer is still alive, and we retry to get the data over. though at the same time the "knock out counter" ko-count is decremented. once the ko-count reaches zero, we conclude that, while still being able to answer short network packets, the other node seems to have a broken io-subsystem, and would stall the Primary. so we kick it out, going StandAlone on the current Primary. in your case it is not a broken io subsystem, but broken firewall rules. you use a 2.4 kernel, which has only one global disk run queue. that one blocks on drbd. everything requiring io can only be queued, but not serviced. including library access for spawning a new shell or writing a wtmp/utmp record, or updating the atime of something... kernel 2.6 has (more or less) one run queue per device, so one device should no longer block others or the full system from running. -- : Lars Ellenberg Tel +43-1-8178292-0 : : LINBIT Information Technologies GmbH Fax +43-1-8178292-82 : : Vivenotgasse 48, A-1120 Vienna/Europe http://www.linbit.com : __ please use the "List-Reply" function of your email client.