[DRBD-user] [Q] What would cause fsck running on a drbd device to just stop?

Lars Ellenberg Lars.Ellenberg at linbit.com
Mon May 29 13:20:35 CEST 2006

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


/ 2006-05-28 18:39:21 -0400
\ Maurice Volaski:
> drbd-0.7.19 under kernel 2.6.17-rc4 is running on a primary node
> standalone. There are 8 resources in the same group. fsck.ext3 -fv
> is being run simultaneously on all of them. Each of the drbd devices
> are running on an lv, which all belong to a single pv. The actual
> "disk" is a hardware RAID connected via SCSI (i.e., the mpt driver).
> 
> Five of the fsck finished their tasks successfully and reported no
> problems. The remaining three got "stuck". There was no activity
> either on the physical RAID itself or listed in top. They were just
> listed as "D," uninterruptible sleep. Two of the fscks were at the
> end, giving the final summary information--no problems---for their
> respective filesystems and were stuck at that point. The last one
> was stuck in the first pass. Attempts to kill them failed; even kill
> -9 and attempting to shutdown were ignored. I rebooted manually and
> ran the three stuck ones again without a hitch.

you cannot "kill" "uninterruptible sleep".
thats just the point of being "uninterruptible".

If you suspect a DRBD bug (you obviously do),

* _reproduce_,
  preferably with a non-rc-kernel
  if you cannot reproduce, we'll file it under "cosmic rays".
  if you can only reproduce with some rc-kernel, it is very likely
  that drbd is not the cause, but the trigger only:
  -> tell "them" about it.

once "easily" reproduced:
* is anything else "stuck"?
* what are the numbers in /proc/drbd
* get the wchan of the stuck processes
  via top, or ps -eo pid,wchan:40,comm, or cat /proc/<pid>/wchan
    (not only the fsck, but everything that may be
    related, so fs-related kernel threads, vm-related
    kernel threads, drbd-related kernel-threads).
* get a backtrace of the stuck processess
  e.g. echo t > /proc/sysrq-trigger
  (again, everything that may be related)
  note that these are not necessarily reliable,
  but may be helpful to understand whats going on.

anything else that may be helpfull to debug the problem.
you seem to have a habit of complaining on many lists
about many things, you meanwhile should know that just saying
"it does not work" won't help in solving issues. And you probably
also have already been told many times what is necessary for a
useful bug report.

as for the Question:
 [Q] What would cause fsck running on a drbd device to just stop?
 could be that you have just been too impatient.
 could be a process scheduler bug, could be a io-scheduler bug,
 could even be a hardware thing.
 could be an in-kernel memory allocation within the io-path without
 GFP_NOIO. this does not need to be in drbd, drbd just causes more
 memory pressure.
 could be... lot of things one can speculate about if one just has
 "does not work".
 
cheers,

-- 
: Lars Ellenberg                                  Tel +43-1-8178292-0  :
: LINBIT Information Technologies GmbH            Fax +43-1-8178292-82 :
: Schoenbrunner Str. 244, A-1120 Vienna/Europe   http://www.linbit.com :
__
please use the "List-Reply" function of your email client.



More information about the drbd-user mailing list