[DRBD-user] disk-timeout actually in deciseconds?

Wed Nov 26 11:26:13 CET 2014

I am currently investigating a problem with one drbd peer.

We are runnung 8.4.5 on this and had configured disk-timeout 300 in 
global-properties.

On Monday I observed a failing primary but could not see any hints for 
the reason.

Today I saw this in the log file:

Nov 26 07:56:33 philippus-arabs kernel: [118541.912505] block drbd10: 
Local backing device failed to meet the disk-timeout
Nov 26 07:56:33 philippus-arabs kernel: [118541.912516] block drbd10: 
disk( UpToDate -> Failed )
Nov 26 07:56:33 philippus-arabs kernel: [118541.912524] block drbd10: 
Local IO failed in request_timer_fn. Detaching...
Nov 26 07:56:33 philippus-arabs kernel: [118541.912597] block drbd10: 
local WRITE IO error sector 635915776+5 on sdb1
Nov 26 07:56:33 philippus-arabs kernel: [118541.913113] block drbd10: 
bitmap WRITE of 0 pages took 0 jiffies
Nov 26 07:56:33 philippus-arabs kernel: [118541.913121] block drbd10: 0 
KB (0 bits) marked out-of-sync by on disk bit-map.
Nov 26 07:56:33 philippus-arabs kernel: [118541.913131] block drbd10: 
disk( Failed -> Diskless )
Nov 26 07:56:33 philippus-arabs kernel: [118541.913819] block drbd10: 
receiver updated UUIDs to effective data uuid: C43F4D29D7F6A132
Nov 26 07:56:33 philippus-arabs kernel: [118541.915874] block drbd10: 
Should have called drbd_al_complete_io(, 635915776, 2560), but my Disk 
seems to have failed :(

I suspect this might have happend on monday with a subsequent kernel 
panic, I don't have seen the screen and the machine was reset after this.

But as I cannot find anything suspicious from the raid controller log 
(no failing disk, no command timeout, no bus reset) and have never seen 
any i/o access time on this disk subsystem even near 30s so I wondered 
if I really have configured 30s of disk-timeout or rather 3s (which I 
could imagine to occur in normale operation in a high i/o situation as 
this is a hard disk backed device which sometimes is under highly random 
i/o pressure).

Unfortunately I have no other suspicious information from syslog or 
dmesg, especially not for the first crash, nothing logged from drbd too.

So my questions are:

- is it possible to imagnine that drdb detaches a device and is not able 
to log this information because a kernel panic happens too fast 
afterwards or would drbd do this the other way round (log, detach, 
kernel panic)  (the syslog is on another backing device, attached to 
another disk controller)

- is it possible that the documentation is wrong about this param being 
specified in deciseconds?

After attaching device the sync went through without problems and the 
node is up again serving normally.

regards, Felix