[DRBD-user] Re: A big bug in datadisk

Tue May 11 13:14:18 CEST 2004

Hello, 
> ah, you are refering to that three liner patch from early may.
> this does nothing usefull but adds some "echo > /var/log/drbd.log" just
> before every call of logger for those who don't have the standard logger
> binary at hand to interface with syslog...

Well, at least that 'standard logger' is absent in Debian... Are you sure it 
is standard on all systems?

> if you don't give the option to the script, it is not active.
> you have to explicitly request "datadisk --dontkill stop" for it
> to not try to kill users of that device.
> so what?

OK, sorry, maybe I was patching the wrong place.

> "grep dontwait datadisk" shows exactly six matches, two of them
> assignments, two messages, and two part of an if, only one of them is
> interesting: it does not wait for the sync to be completed when trying
> to deactivate a device while synchronisation is in progress.
> the normal behaviour for "datadisk stop" would be to wait for the sync
> to complete before trying to deactivate the current Primary.

Thanks for explanation.

> > When the 0.7 will be out? Anyway, I am not using hearbeat - I am using
> > Keepalived, and I suppose in that case they also will have to make that
> > script? In my opinion, DRBD is the right place to do low-level things as
> > mounting/unmounting filesystems.
>
> not exactly, since DRBD is a block device.
> though not very common, there are other ways to use a block device
> than having a file system on it...

Ouch... What namely?

And what about the _main_ uses of keepalived/heatbeat? Should they unmount 
filesystems? I think, no.

> if you have it mounted, someone/thing did the mount.
> so that same someone needs to do the umount, before you deactivate the
> device. thats not at all DRBDs business.

Yes, they need to be unmounted. But then why do you need to kill processes? 
They should be stopped as well? But somehow you've rightly decided to make 
that option and I think that to complete that process, internal mounts should 
also be terminated.

> thats why you have to start things in a certain order, and usually stop
> things in reverse order. you obviously need to properly pair starts and
> stops.

Well, then something else is broken, since my scripts are correct, and they 
should unmount the filesystems. They have been checked and rechecked, and my 
opinion that if they were incorrect, my system would not be able to start at 
all... Naive me :)))

> be aware that *all* operations may hang or block, for whatever reason.
> so if you want to migrate services, and you need to stop them on the
> active node before you can start them on some other node, and the
> stopping hangs, you have a problem. *every* cluster manager needs to be
> able to cope with hanging resource scripts. typically the only way out
> of this is STONITH.

No, STONITH is a way brutal and at all unacceptable for me since I have two 
virtual servers running on two physical servers - normally each one physical 
server is responsible for one of the logical servers, but in case one of the 
physical ones is dead, the second one will take over.

My opinion is that if we are trying to make a reliable cluster we must take 
care of hanging processes in a way that is incorrect at all for the kernel 
and whatever-other-subsystem developers - terminate that processes forcibly 
even if they are in 'D' uninterruptible state. But without powering the other 
node off.

Dmitry