Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Am 26.11.2014 14:27, schrieb Lars Ellenberg: > > DRBD "logging" is simply a printk. > Whether or not that makes it to stable storage via some syslog channel > or not is no longer in control of DRBD. > Especially if the storage in fact *did* have problems, I think it is > very unlikely that any logging would have made it to disk on that box... I don't think the storage ACTUALLY had a problem besides possibly being under high load. At least I cannot tell that anything was bad from the raid controller or kernel logs. Besides that as I said the syslog is on a separate disk subsystem, presented by a different controller, using a different driver, so I assume even if some raid controller or disk subsystem is having a problem it should still always be possible to log to syslog as long as the system has not crashed. > Also: the disk-timeout option is *dangerous* and *may lead to kernel > panic*. So don't use it (unless you are *very* certain that you know > what you are doing, and have a very good reason to do it). I read that before and my intent is the following: If a disk subsystem on the master is neither reacting nor throwing i/o errors the master role should be transfered to the peer no matter what. So I would be accepting a kernel panic occuring in such situation rather than waiting forever for a non reacting disk subsystem which would be less acceptable in my opinion. The problem in this situation was that I prepared the drbd config for a cluster manager installed and properly configured to do all that but in fact I did not have enough time in the last maintenance time window to apply my cluster configuration, for other problems that occured. So in this situation the disk-timeout does not make sense as I risk the system crashing here and noone taking over. So I removed the disk-timeout setting now but still intend to use it later when the cluster manager is in place. But I still have to monitor this behaviour again in my test-setup to make sure I never reach a disk-timeout situation in normal working conditions, but as far as I can tell from my munin logs and watching iostat under high load it should never be the case that a volume is inresponsive for more than 30s, at least as long as it does not ACTUALLY have a serious problem. > Of course there may be bugs in our code, so if you should be able to > reproduce "misbehaviour", let us know. I will do testing with this again in my lab to see under which conditions the disk-timeout might be reached. Thank you for commenting. regards, Felix