[DRBD-user] disk-timeout actually in deciseconds?

Wed Nov 26 16:15:09 CET 2014

Am 26.11.2014 14:27, schrieb Lars Ellenberg:
>
> DRBD "logging" is simply a printk.
> Whether or not that makes it to stable storage via some syslog channel
> or not is no longer in control of DRBD.
> Especially if the storage in fact *did* have problems, I think it is
> very unlikely that any logging would have made it to disk on that box...

I don't think the storage ACTUALLY had a problem besides possibly being 
under high load. At least I cannot tell that anything was bad from the 
raid controller or kernel logs. Besides that as I said the syslog is on 
a separate disk subsystem, presented by a different controller, using a 
different driver, so I assume even if some raid controller or disk 
subsystem is having a problem it should still always be possible to log 
to syslog as long as the system has not crashed.

> Also: the disk-timeout option is *dangerous* and *may lead to kernel
> panic*.  So don't use it (unless you are *very* certain that you know
> what you are doing, and have a very good reason to do it).

I read that before and my intent is the following:

If a disk subsystem on the master is neither reacting nor throwing i/o 
errors the master role should be transfered to the peer no matter what. 
So I would be accepting a kernel panic occuring in such situation rather 
than waiting forever for a non reacting disk subsystem which would be 
less acceptable in my opinion.

The problem in this situation was that I prepared the drbd config for a 
cluster manager installed and properly configured to do all that but in 
fact I did not have enough time in the last maintenance time window to 
apply my cluster configuration, for other problems that occured.

So in this situation the disk-timeout does not make sense as I risk the 
system crashing here and noone taking over. So I removed the 
disk-timeout setting now but still intend to use it later when the 
cluster manager is in place.

But I still have to monitor this behaviour again in my test-setup to 
make sure I never reach a disk-timeout situation in normal working 
conditions, but as far as I can tell from my munin logs and watching 
iostat under high load it should never be the case that a volume is 
inresponsive for more than 30s, at least as long as it does not ACTUALLY 
have a serious problem.

> Of course there may be bugs in our code, so if you should be able to
> reproduce "misbehaviour", let us know.

I will do testing with this again in my lab to see under which 
conditions the disk-timeout might be reached. Thank you for commenting.

regards, Felix