[DRBD-user] Re: drbd 0.7.21(kernel 2.6.17) hang problem

Wed Sep 20 18:36:38 CEST 2006

/ 2006-09-20 18:22:12 +0200
\ Maciej Bogucki:
> >in drbd:
> >you could play with "unplug-watermark" and "max-epoch-size" (and
> >possibly max-buffers).
> >when I say "play", I mean it.  it could get better if you increase,
> >it could get better when you decrease, it could get better if you
> >adjust in opposite directions (where possible), and it could happen to
> >have no noticable effect at all, which is all very dependent on your
> >lower level io subsystem and on network timings and ...
> I know, than I can play with them, but there is another strange thing. When I disconnect secondary node(shutdown 
> heartbeat, and drbd) I get lags also. Also I don't have much traffic on database(1256 writes per minute - so I's 
> 20KB per seconds, and only a few reads per minute), so playing with "net" parameters is not necessary in my case. I 
> think that it is drbd bug or some stupid thing :)

what about al-extends?
how often do the bm: and al: numbers increase in /proc/drbd?
(watch -n1 cat /proc/drbd, and look at that)

> >>resource datafs {
> >> protocol C;
> >> incon-degr-cmd "echo '!DRBD! pri on incon-degr' | wall ; sleep 60 ; halt -f";
> >I know the "halt -f" is in the example config, but you may want to
> >consider to write something like "sleep <verylargenumber>" or
> >"killall -9 heartbeat ccm ipfail" instead...
> But when I do like You write, there is a higher chance that I get split brain. When I do "halt -f" the chance is 
> minimal.

no. this callback gets triggered when drbd _is_ already Inconsistent
(local data set is invalid) and has no access to good data (is not
connected to a node with good data), but someone (or something) tells it
to become primary anyways, which should not have happened.

this may have been operator error. you will notice, and apreachiate that
the box won't just halt itself because you have entered the
"drbdadm primary" in the wrong xterm.

this may be cluster manager's lack of knowledge and ignorance about
drbd's internal state. the "sleep <largenumber>" would just block that
operation (which will fail anyways, but...). the "killall" will kill the
cluster manager, if it happens to be heartbeat...

both leave the node itself accessible for the operator,
without going to power cycle (remotely).
there are cases where the "halt -f" is useful there.  but there are more
cases where it is more useful to leave the node in a state where you can
log in and fix things.

we invented that callback and put the "halt -f" in there when heartbeat
(note: heartbeat 1.2.x, x <= 3 and no vendor patches) was still ignoring
failures to start resources, and would still (try to) start (likely
dependent) resources. so without this you could end up with running web
servers but no data base available, or a few/all resources not running
but the cluster ip assigned or some such.

killing just heartbeat (resp. the cluster manager) instead of the
complete node would have more or less the same effect (depending on the
start order of resources in the ha.conf).

not doing anything but just fail the "drbddisk start" is the right thing
to do now with heartbeat recognizing the failure and giving up the
resource group again.

I should write this into the faq or manpage sometime.

-- 
: Lars Ellenberg                                  Tel +43-1-8178292-0  :
: LINBIT Information Technologies GmbH            Fax +43-1-8178292-82 :
: Schoenbrunner Str. 244, A-1120 Vienna/Europe   http://www.linbit.com :
__
please use the "List-Reply" function of your email client.