Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Wed, Nov 26, 2014 at 11:26:13AM +0100, Felix Zachlod wrote: > I am currently investigating a problem with one drbd peer. > > We are runnung 8.4.5 on this and had configured disk-timeout 300 in > global-properties. > > On Monday I observed a failing primary but could not see any hints > for the reason. > > Today I saw this in the log file: > > Nov 26 07:56:33 philippus-arabs kernel: [118541.912505] block drbd10: Local backing device failed to meet the disk-timeout > Nov 26 07:56:33 philippus-arabs kernel: [118541.912516] block drbd10: disk( UpToDate -> Failed ) > Nov 26 07:56:33 philippus-arabs kernel: [118541.912524] block drbd10: Local IO failed in request_timer_fn. Detaching... > Nov 26 07:56:33 philippus-arabs kernel: [118541.912597] block drbd10: local WRITE IO error sector 635915776+5 on sdb1 > Nov 26 07:56:33 philippus-arabs kernel: [118541.913113] block drbd10: bitmap WRITE of 0 pages took 0 jiffies > Nov 26 07:56:33 philippus-arabs kernel: [118541.913121] block drbd10: 0 KB (0 bits) marked out-of-sync by on disk bit-map. > Nov 26 07:56:33 philippus-arabs kernel: [118541.913131] block drbd10: disk( Failed -> Diskless ) > Nov 26 07:56:33 philippus-arabs kernel: [118541.913819] block drbd10: receiver updated UUIDs to effective data uuid: C43F4D29D7F6A132 > Nov 26 07:56:33 philippus-arabs kernel: [118541.915874] block drbd10: Should have called drbd_al_complete_io(, 635915776, 2560), > > I suspect this might have happend on monday with a subsequent kernel > panic, I don't have seen the screen and the machine was reset after > this. > > But as I cannot find anything suspicious from the raid controller > log (no failing disk, no command timeout, no bus reset) and have > never seen any i/o access time on this disk subsystem even near 30s > so I wondered if I really have configured 30s of disk-timeout or > rather 3s (which I could imagine to occur in normale operation in a > high i/o situation as this is a hard disk backed device which > sometimes is under highly random i/o pressure). > > Unfortunately I have no other suspicious information from syslog or > dmesg, especially not for the first crash, nothing logged from drbd > too. > > So my questions are: > > - is it possible to imagnine that drdb detaches a device and is not > able to log this information because a kernel panic happens too fast > afterwards or would drbd do this the other way round (log, detach, > kernel panic) (the syslog is on another backing device, attached to > another disk controller) DRBD "logging" is simply a printk. Whether or not that makes it to stable storage via some syslog channel or not is no longer in control of DRBD. Especially if the storage in fact *did* have problems, I think it is very unlikely that any logging would have made it to disk on that box... Also: the disk-timeout option is *dangerous* and *may lead to kernel panic*. So don't use it (unless you are *very* certain that you know what you are doing, and have a very good reason to do it). More details in the man pages of drbd.conf and drbdsetup. > - is it possible that the documentation is wrong about this param > being specified in deciseconds? No, the documentation is *right* and documents correctly that the unit of this parameter is 0.1 seconds, 1/10 of a second, or 100ms (all the same). At least everywhere I looked. Of course there may be bugs in our code, so if you should be able to reproduce "misbehaviour", let us know. > After attaching device the sync went through without problems and > the node is up again serving normally. -- : Lars Ellenberg : http://www.LINBIT.com | Your Way to High Availability : DRBD, Linux-HA and Pacemaker support and consulting DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. __ please don't Cc me, but send to list -- I'm subscribed