[DRBD-user] handling I/O errors scenarios

Lars Ellenberg lars.ellenberg at linbit.com
Mon May 16 16:45:21 CEST 2011

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


On Tue, May 10, 2011 at 12:45:08PM +0200, motyllo8 wrote:
> Hi,
> 
> 
> I have cluster in Active/Passive configuration. Currently I am trying
> to support situation when I/O errors occur. I noticed that in
> drbd.conf default behaviour is halt node with failed disks.

Default behaviour?
Certainly not.
Example configuration maybe, for a certain use case,
to point out what is possible to configure.

> This is a little bit brutal for me. What kind of other scenarios are
> here taken into account, if any? 

Once a node reaches a state where it cannot service IO requests anymore,
in certain deployments a fast node failure can improve overall service
availability.

> I was considering only disconnecting replication for resources where
> I/O errors occured and then promoting second node, but as I know this
> is not possible when resource is in Diskless state (btw. why?):
>
> drbdadm disconnect myresource                                                                                                                                                                                                              
> 0: State change failed: (-2) Refusing to be Primary without at least one UpToDate disk                                                                                                                                                       
> Command 'drbdsetup 0 disconnect' terminated with exit code 17

A Primary, that has become Diskless (because IO errors caused it to
detach, or because of an explicit detach), will refuse to be
"gracefully" disconnected: Because that would cause the data to become
unavailable.

A Diskless Primary, while still connected, will service application
requests just fine via the other node.

So you can pick a convenient time to do a graceful switchover:
stop the services, demote DRBD on the Diskless node,
promote it on the other, and start services there.

Or, of course, you could fix the broken disk,
and re-attach it to the Primary.


If you (forcefully) disconnect a Diskless Primary, it can only fail all
IO requests (there is no data to service them from), or block indefinitely.

That's why it does not let you do that gracefully.
It just protects you from shooting yourself.


-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed



More information about the drbd-user mailing list