Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
I am currently investigating a problem with one drbd peer. We are runnung 8.4.5 on this and had configured disk-timeout 300 in global-properties. On Monday I observed a failing primary but could not see any hints for the reason. Today I saw this in the log file: Nov 26 07:56:33 philippus-arabs kernel: [118541.912505] block drbd10: Local backing device failed to meet the disk-timeout Nov 26 07:56:33 philippus-arabs kernel: [118541.912516] block drbd10: disk( UpToDate -> Failed ) Nov 26 07:56:33 philippus-arabs kernel: [118541.912524] block drbd10: Local IO failed in request_timer_fn. Detaching... Nov 26 07:56:33 philippus-arabs kernel: [118541.912597] block drbd10: local WRITE IO error sector 635915776+5 on sdb1 Nov 26 07:56:33 philippus-arabs kernel: [118541.913113] block drbd10: bitmap WRITE of 0 pages took 0 jiffies Nov 26 07:56:33 philippus-arabs kernel: [118541.913121] block drbd10: 0 KB (0 bits) marked out-of-sync by on disk bit-map. Nov 26 07:56:33 philippus-arabs kernel: [118541.913131] block drbd10: disk( Failed -> Diskless ) Nov 26 07:56:33 philippus-arabs kernel: [118541.913819] block drbd10: receiver updated UUIDs to effective data uuid: C43F4D29D7F6A132 Nov 26 07:56:33 philippus-arabs kernel: [118541.915874] block drbd10: Should have called drbd_al_complete_io(, 635915776, 2560), but my Disk seems to have failed :( I suspect this might have happend on monday with a subsequent kernel panic, I don't have seen the screen and the machine was reset after this. But as I cannot find anything suspicious from the raid controller log (no failing disk, no command timeout, no bus reset) and have never seen any i/o access time on this disk subsystem even near 30s so I wondered if I really have configured 30s of disk-timeout or rather 3s (which I could imagine to occur in normale operation in a high i/o situation as this is a hard disk backed device which sometimes is under highly random i/o pressure). Unfortunately I have no other suspicious information from syslog or dmesg, especially not for the first crash, nothing logged from drbd too. So my questions are: - is it possible to imagnine that drdb detaches a device and is not able to log this information because a kernel panic happens too fast afterwards or would drbd do this the other way round (log, detach, kernel panic) (the syslog is on another backing device, attached to another disk controller) - is it possible that the documentation is wrong about this param being specified in deciseconds? After attaching device the sync went through without problems and the node is up again serving normally. regards, Felix