[DRBD-user] DRBD I/O problems & corrupted data

Saso Slavicic saso.linux at astim.si
Mon Jan 26 16:11:26 CET 2015

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hello,

Thank you for taking the time to review my (rather lengthy) description.

> From your config below:
>>                 local-io-error   "/usr/lib/drbd/notify-io-error.sh;
drbdadm detach $DRBD_RESOURCE";

> VERY bad idea.
> Synchronously calling drbdadm from inside a synchronous handler will block
> (until that drbdadm will eventually timeout, 121 seconds later).
> And it is absolutely useless: DRBD will detach after local io error all by
itself.

This is not really obvious from the documentation:

on-io-error handler

handler is taken, if the lower level device reports io-errors to the upper
layers.
handler may be pass_on, call-local-io-error *or* detach.

If on-io-error is set to call-local-io-error, DRBD will also detach?

On a similar note, I've also been abusing the out-of-sync handler pretty
much the same way: to issue a disconnect and reconnect on out of sync.
How would DRBD need to be configured to automatically do a
disconnect/reconnect after verify has found out-of-sync blocks?

I don't really remember where I got the idea to use the handlers, but a
quick Google shows this:

"The reason of having out-of-sync handler is exactly tp provide possibility
of automation."

Perhaps it should be noted stronger in the documentation not to change DRBD
states in the handlers?

> I would expect that you have a few
>             "INFO: task drbd_w_... blocked for more than 120 seconds."
> and then call traces in the kernel log as well?

Not really, but I just found out my hung_task_warnings is set to 0. No idea
why as I can't remember setting this neither does a search in /etc find
anything related to this.I'll need to investigate more.

> There.
> This is actually something we may need to fix.
> But that's in fact a multiple error scenario including misconfiguration we
did not think of yet.

There always is a multiple error scenario ;-)

> Because we did not anticipate that you would block the worker,
> and did call the handler before we notify the peer.
.
> *again* your blocking drbdadm detach from the local-io-error handler is
the trigger.
> Don't do that. You also should even background the "notify", if you really
insist on using it.

A local-io-error handler in the documentation example even has 'halt -f' in
the sequence. How does that run before notifying the peer?
However no documentation examples show handlers being backgrounded. ?

> really? for 7 megabyte you want to pull ahead already?
> you basically won't have a consistent secondary, ever.

Another problem for me with DRBD. Without proxy, DRBD even in protocol A
blocks when network buffer is full. If buffer is set to 10MB, pull ahead
needs to be before buffer is full or DRBD blocks. Our write load is low
enough not to trigger resync too often, but when it does I want it to pull
ahead, not slow down to WAN link speed. Somehow I don't feel comfortable
with the idea of having hundreds of MBs of network buffer in the kernel. In
this case, the secondary is more a standby to try to have the last possible
data, so I don't mind it being a bit out of sync from time to time.

Regards,
Saso Slavicic



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20150126/dbc1a755/attachment.htm>


More information about the drbd-user mailing list