Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Oops... Forgot to add that I was starting "drbdadm verify r0" during weekends (manually)... ;) Yes, I know. I have to reproduce the error to be able to fix it. I can reproduce it, but not right now. I'll get a call very early tomorrow morning if I do ;) I have to wait for a service-stop and/or install 2 more mashines when I do find the time to do that (by the way, DELL PowerEdge R200 quad core)... This is not a "bug-report" just a "experiance-report". Online verify completes the task and it took up to 24 hours (maybe more) before I saw any problems. The problems that I saw was that the system was "degrading" in functionality. 1) "Working fine". Everything worked fine. Even verify was working. 2) "Hmm. Now, what? Timeout? Ahh, there it is!". The system suddenly stopped responding to ping (for example) for a while or some services just dead. 3) "Hmm... Refusing connections? But I'm logged in already on another SSH!". Ping was gone again. Completly gone. Refused new SSH. But I was looged in over an excisting SSH-connection, so the network was there. The mouse might work, but not the keyboard. X could have gone dead. Lots of strange things in short words. When I saw this I never logged out my SSH connection. 4) "Oops. That's not good! Reboot!". My SSH-connection died. Keyboard lockouts. Blank screens (even when using X-Windows). Sometimes a kernel-panic. This strange behaviour dissapeared when I stopped using verify during weekends. That's the only difference. I know it sound lame and strange, and I know that it looks like I'm a complete newbe. But after replacing a lot of hardware, changing/reconfiguring/recompiling kernels and hours and hours of thinking. The only thing left to try was to stop using verify. It's been working for over a month now (before it crashed once a week, sunday or monday). I'll get back when I've installed two more or can reproduce the error... /S ---------- Original Message ----------- From: Lars Ellenberg <lars.ellenberg at linbit.com> To: drbd-user at linbit.com Sent: Mon, 30 Jun 2008 15:38:45 +0200 Subject: Re: [DRBD-user] drbdadm verify all seems to produce false positives on ext3 and crash the server > On Mon, Jun 30, 2008 at 03:27:50PM +0200, Stefan Löfgren wrote: > > Hi again... > > I feel the need to jump in here and share what I've found out so far... > > > > I have experianced the same kind of problem as you have. Crashes, keyboard > > lockout, blank screens, kernel panics and stuff like that. For some strange > > reason on sundays (or monday mornings)! > > just randomly pointing fingers here... > > but maybe you are running on top of some raid, and maybe that raid > defaults to do a "check run" every so often, and when drbd tries to > write the whole bitmap, quasi-synchronously, that just takes ages, > because the raid array is so busy "resilvering" itself? > > > Well, I tried to explain in an earlier thread what was going on and failed > > completely. Anyway, I said that I would leave the system as it is to prove > > that it's the verify procedure that causes the system to go completly mad. > > > > Now, 4-5 weeks later (maybe more) there has been nothing fishy going on with > > the system. Instead of having a crash every monday morning (3 or 4 weeks in a > > row) I have not seen anything since I stopped playing around with verify. > > > > So, my conclution: Don't use verify. I want to, but it will crash the system. > > > > I was not able to get anything out from the logs and the kernel-panic was > > completly rubbish. Different panics, if there even was a panic. Mostly just a > > freeze... > > > > I don't have the logs anymore, but the last message in the log for at least > > one of the crashes was a message like: > > "Re-transfer bitmap due to failed kmalloc" or something like that... > > that message would be "Writing the whole bitmap, due to failed kmalloc". > and though the "failed kmalloc" may sound scary, > it actually is not at all, and only bad wording. > so this is no hint in any direction, whatsoever. > > > The message before that was telling me that the online verify was completed I > > beleave... > > > > So, my suggestion is stop looking for hardware errors. Instead stop using > > "verify". That's what I did when I was unable to find a hardware error... > > > > The sad thing is that I can not reproduce this error except on my machines > > that are in production enviroment right now... > > too bad. > not reproducable is almost equivalent to not fixable. > and it apears to work fine here. > > :( > > -- > : Lars Ellenberg http://www.linbit.com : > : DRBD/HA support and consulting sales at linbit.com : > : LINBIT Information Technologies GmbH Tel +43-1-8178292-0 : > : Vivenotgasse 48, A-1120 Vienna/Europe Fax +43-1-8178292-82 : > __ > please don't Cc me, but send to list -- I'm subscribed > _______________________________________________ > drbd-user mailing list > drbd-user at lists.linbit.com > http://lists.linbit.com/mailman/listinfo/drbd-user ------- End of Original Message -------