Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
/ 2006-09-20 18:22:12 +0200 \ Maciej Bogucki: > >in drbd: > >you could play with "unplug-watermark" and "max-epoch-size" (and > >possibly max-buffers). > >when I say "play", I mean it. it could get better if you increase, > >it could get better when you decrease, it could get better if you > >adjust in opposite directions (where possible), and it could happen to > >have no noticable effect at all, which is all very dependent on your > >lower level io subsystem and on network timings and ... > I know, than I can play with them, but there is another strange thing. When I disconnect secondary node(shutdown > heartbeat, and drbd) I get lags also. Also I don't have much traffic on database(1256 writes per minute - so I's > 20KB per seconds, and only a few reads per minute), so playing with "net" parameters is not necessary in my case. I > think that it is drbd bug or some stupid thing :) what about al-extends? how often do the bm: and al: numbers increase in /proc/drbd? (watch -n1 cat /proc/drbd, and look at that) > >>resource datafs { > >> protocol C; > >> incon-degr-cmd "echo '!DRBD! pri on incon-degr' | wall ; sleep 60 ; halt -f"; > >I know the "halt -f" is in the example config, but you may want to > >consider to write something like "sleep <verylargenumber>" or > >"killall -9 heartbeat ccm ipfail" instead... > But when I do like You write, there is a higher chance that I get split brain. When I do "halt -f" the chance is > minimal. no. this callback gets triggered when drbd _is_ already Inconsistent (local data set is invalid) and has no access to good data (is not connected to a node with good data), but someone (or something) tells it to become primary anyways, which should not have happened. this may have been operator error. you will notice, and apreachiate that the box won't just halt itself because you have entered the "drbdadm primary" in the wrong xterm. this may be cluster manager's lack of knowledge and ignorance about drbd's internal state. the "sleep <largenumber>" would just block that operation (which will fail anyways, but...). the "killall" will kill the cluster manager, if it happens to be heartbeat... both leave the node itself accessible for the operator, without going to power cycle (remotely). there are cases where the "halt -f" is useful there. but there are more cases where it is more useful to leave the node in a state where you can log in and fix things. we invented that callback and put the "halt -f" in there when heartbeat (note: heartbeat 1.2.x, x <= 3 and no vendor patches) was still ignoring failures to start resources, and would still (try to) start (likely dependent) resources. so without this you could end up with running web servers but no data base available, or a few/all resources not running but the cluster ip assigned or some such. killing just heartbeat (resp. the cluster manager) instead of the complete node would have more or less the same effect (depending on the start order of resources in the ha.conf). not doing anything but just fail the "drbddisk start" is the right thing to do now with heartbeat recognizing the failure and giving up the resource group again. I should write this into the faq or manpage sometime. -- : Lars Ellenberg Tel +43-1-8178292-0 : : LINBIT Information Technologies GmbH Fax +43-1-8178292-82 : : Schoenbrunner Str. 244, A-1120 Vienna/Europe http://www.linbit.com : __ please use the "List-Reply" function of your email client.