Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
/ 2004-05-10 22:56:37 +0300 \ Dmitry Golubev: > > I'll just bounce it to the list for now. > > And, I don't see were the *bug* is? > > OK, subscribing to the list... > > > > Today I've fount a really big bug in datadisk which causes HA cluster > > > locking. During the procedure it tries to unmount DRBD partition, it > > > first tries just to unmount, then tries to kill all users which access > > > the partition (please put this option by default, as in the patch I've > > > send you recently, - it really helps). Yes? > > > > You do want to "put this option" what? on? off? what option? > > Mentioned mail obviously did not reach me. > > Message resent at your request - please notify, if you have not received it > yet. ah, you are refering to that three liner patch from early may. this does nothing usefull but adds some "echo > /var/log/drbd.log" just before every call of logger for those who don't have the standard logger binary at hand to interface with syslog... > The option I am referring is dontkill, Which I want to be removed from the > default configuration (I am not sure of what dontwait is for, maybe it also > should be disabled?). For people do not face problems when Keepalived HA > 'cluster' starts working very mysteriously (well, that would be too long to > retell the whole story, what I mean by mysteriously :) if you don't give the option to the script, it is not active. you have to explicitly request "datadisk --dontkill stop" for it to not try to kill users of that device. so what? "grep dontwait datadisk" shows exactly six matches, two of them assignments, two messages, and two part of an if, only one of them is interesting: it does not wait for the sync to be completed when trying to deactivate a device while synchronisation is in progress. the normal behaviour for "datadisk stop" would be to wait for the sync to complete before trying to deactivate the current Primary. > > > But in case something is mounted there (or binded with mount --bind), the > > > partition won't unmount even if you kill every single process in the > > > system! I will send you a patch tomorrow, which tries to unmount all > > > internal mounts. > > > > Thanks. > > > > The "right" way to go about this seems to be: only have a drbddisk > > script (as with 0.7) which *only* activates/deactivates (make > > Primary/Secondary) the device, and use the Filesystem resource script > > provided by heartbeat to mount/umount. > > When the 0.7 will be out? Anyway, I am not using hearbeat - I am using > Keepalived, and I suppose in that case they also will have to make that > script? In my opinion, DRBD is the right place to do low-level things as > mounting/unmounting filesystems. not exactly, since DRBD is a block device. though not very common, there are other ways to use a block device than having a file system on it... > That is really good to separate that really > huge script into few nice pieces, but please leave them in DRBD. we have a script to configure the devices, which is intended to be used as init script, and one to activate/deactivate a certain device, which is intended to be used by the cluster manager, and provides basically only a mapping from "drbd resource name" to "drbd device node". I don't think that anything more should be added to the drbd scripts. > As for unmounting binded partitions, I feel like the only possible way (at > least I am not aware of any other...) is to scan through the current 'mount' > table, and grep and awk the needed mount-points from there. if you have it mounted, someone/thing did the mount. so that same someone needs to do the umount, before you deactivate the device. thats not at all DRBDs business. thats why you have to start things in a certain order, and usually stop things in reverse order. you obviously need to properly pair starts and stops. be aware that *all* operations may hang or block, for whatever reason. so if you want to migrate services, and you need to stop them on the active node before you can start them on some other node, and the stopping hangs, you have a problem. *every* cluster manager needs to be able to cope with hanging resource scripts. typically the only way out of this is STONITH. Lars Ellenberg