[DRBD-user] Secondary node halts primary

Fri Nov 17 14:42:37 CET 2006

/ 2006-11-17 13:00:32 +0100
\ Lars Knudsen:
> Hello,
> 
> I'm currently using DRBD for test purposes on two Dell PE2950's with a
> megasas disksystem.
> My system is a SuSE 10.1 installation - kernel 2.6.16.13-4-smp, drbd
> drbd: initialised. Version: 0.7.17 (api:77/proto:74)
> drbd: SVN Revision: 2125 build by lmb at chip, 2006-03-27 17:40:22
> 
> I have created a simple DRBD volume using below config (no linux-ha,
> heartbeat etc yet):
> 
> --
> resource drbd0 {
>          protocol C;
>          incon-degr-cmd "halt -f";

# from the example config file:
  disk {
    # if the lower level device reports io-error you have the choice of
    #  "pass_on"  ->  Report the io-error to the upper layers.
    #                 Primary   -> report it to the mounted file system.
    #                 Secondary -> ignore it.
    #  "panic"    ->  The node leaves the cluster by doing a kernel panic.
    #  "detach"   ->  The node drops its backing storage device, and
    #                 continues in disk less mode.
    #
    on-io-error   detach;
  }

also have a look at the "ko-count" parameter.

>          syncer {
>           rate 110M;
>           group 1;
>           al-extents 257;
>          }
> 
>          on dkvm1 {
>            device    /dev/drbd1;
>            disk      /dev/sda6;
>            address   10.100.10.101:7789;
>            meta-disk  internal;
>          }
> 
>          on dkvm2 {
>            device    /dev/drbd1;
>            disk      /dev/sda6;
>            address   10.100.10.102:7789;
>            meta-disk  internal;
>          }
>        }
> --
> A manual test unplugging my primary node has been succesfull - state
> detected - fsck - remount gave me fs access on my secondary node.
> 
> Today I hit this bug on my secondary node:
> http://www.ussg.iu.edu/hypermail/linux/kernel/0610.1/2599.html getting log
> entries like:
> --
> sd 0:2:0:0: megasas: RESET -2189309 cmd=2a
> megasas: [ 0]waiting for 16 commands to complete
> megasas: [ 5]waiting for 16 commands to complete
> megasas: [10]waiting for 16 commands to complete
> --
> I would expect my secondary node to fail, and my primary stay online.
> Instead the /dev/drbd1 seemed to be in a "hanging" state where also my
> primary node hung.

the fact is that drbd 0.7 does not really handle backend storage io errors as
good as it could in all cases. and doing a kernel panic is not the most polite thing
for a driver to do either, but effectively works in a two node drbd cluster (as
long as the other node is healthy).
configuring "detach" or "panic" for the on-io-error
had done the trick in this case, I think.

we improved the behaviour of drbd after hardware errors considerably in drbd8,
and the good news is, we expect it to be released as 8.0 during december.

-- 
: Lars Ellenberg                                  Tel +43-1-8178292-0  :
: LINBIT Information Technologies GmbH            Fax +43-1-8178292-82 :
: Schoenbrunner Str. 244, A-1120 Vienna/Europe   http://www.linbit.com :
__
please use the "List-Reply" function of your email client.