[DRBD-user] Secondary node io-error

Wed Oct 10 11:01:45 CEST 2012

On Wed, Oct 10, 2012 at 03:42:02AM +0000, Velayutham, Prakash wrote:
> 
> On Oct 8, 2012, at 9:19 AM, Velayutham, Prakash wrote:
> 
> > On Oct 8, 2012, at 4:55 AM, Lars Ellenberg wrote:
> > 
> >> On Sat, Oct 06, 2012 at 01:08:43PM +0000, Velayutham, Prakash wrote:
> >>> Hi,
> >>> 
> >>> I recently got a DRBD (8.4.2-2) cluster up (still testing). It seems to work nicely with Pacemaker CRM in several scenarios I have tested. Here is my config.
> >>> 
> >>> global {
> >>>               usage-count     yes;
> >>> }
> >>> 
> >>> common {
> >>>       handlers {
> >>>               outdate-peer    /usr/lib/drbd/crm-fence-peer.sh;
> >>>               fence-peer      /usr/lib/drbd/crm-fence-peer.sh;
> >>>               after-resync-target     /usr/lib/drbd/crm-unfence-peer.sh;
> >>>               local-io-error "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f";
> >>>               split-brain "/usr/lib/drbd/notify-split-brain.sh root";
> >>>       }
> >>> 
> >>>       startup {
> >>>               degr-wfc-timeout        0;
> >>>       }
> >>> 
> >>>       net {
> >>>               shared-secret   1QP69G4kWDslx2TMiaEStI6bwaGH5y8d;
> >>>               after-sb-0pri discard-zero-changes;
> >>>               after-sb-1pri discard-secondary;
> >>>               after-sb-2pri disconnect;
> >>>       }
> >>> 
> >>>       disk {
> >>>               on-io-error     call-local-io-error;
> >>>               fencing resource-and-stonith;
> >>>       }
> >>> 
> >>> }
> >>> 
> >>> The io-error handler only gets called when the primary node has a disk
> >>> issue. I have not seen the secondary node call the "local-io-error"
> >>> handler when it had disk access issues. Is this by design?
> >> 
> >> No.
> >> 
> >> "Works for me", though.
> >> 
> >> Can you please double check?
> >> And if in fact you can reproduce, tell us how, including logs?

> > If I disable all the FC ports in the fiber switch just for the
> > primary node, the node fences, reboots and comes up, as I would
> > expect. With the exact same config, if I disable the FC ports just
> > for the secondary node, the node just sits there and it even shows
> > up as Secondary in /proc/drbd.

> > That sounds odd and sounds like the
> > config should be "diskless", but it is "call-local-io-error".

Huh? What has "config" to do with things,
and what exactly is "config diskless"?

> > Which logs are you wanting me to share?

Those that show DRBD detecting an IO error,
but not calling the io-error handler.

> > Thanks,
> > Prakash
> 
> Just wanted to add this. I repeated my test again and get the exact
> same results again. Here is /proc/drbd of the primary (bmimysqlt3) and
> secondary (bmimysqlt4) before the secondary's disk is cut off
> (disabling the fiber switch port that the secondary is connected to)
> 
> [root at bmimysqlt3 ~]# cat /proc/drbd 
> version: 8.4.2 (api:1/proto:86-101)
> GIT-hash: 7ad5f850d711223713d6dcadc3dd48860321070c build by root at bmimysqlt3.chmcres.cchmc.org, 2012-10-02 00:02:32
>  0: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
>     ns:184 nr:0 dw:160 dr:14317 al:6 bm:6 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
> 
> [root at bmimysqlt4 ~]# cat /proc/drbd 
> version: 8.4.2 (api:1/proto:86-101)
> GIT-hash: 7ad5f850d711223713d6dcadc3dd48860321070c build by root at bmimysqlt3.chmcres.cchmc.org, 2012-10-02 00:02:32
>  0: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-----
>     ns:0 nr:184 dw:184 dr:0 al:0 bm:6 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
> 
> Here is /proc/drbd of primary and secondary about 5 minutes after the disk is cut off.
> 
> [root at bmimysqlt3 ~]# cat /proc/drbd 
> version: 8.4.2 (api:1/proto:86-101)
> GIT-hash: 7ad5f850d711223713d6dcadc3dd48860321070c build by root at bmimysqlt3.chmcres.cchmc.org, 2012-10-02 00:02:32
>  0: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
>     ns:184 nr:0 dw:160 dr:14317 al:6 bm:6 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0

  No additional writes.

> [root at bmimysqlt4 ~]# cat /proc/drbd 
> version: 8.4.2 (api:1/proto:86-101)
> GIT-hash: 7ad5f850d711223713d6dcadc3dd48860321070c build by root at bmimysqlt3.chmcres.cchmc.org, 2012-10-02 00:02:32
>  0: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-----
>     ns:0 nr:184 dw:184 dr:0 al:0 bm:6 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0

Nothing transfered, nothing written, nothing changed.

> As you can see, there is absolutely nothing there to suggest that the
> secondary even noticed the io-error.
>
> I can't understand what is going on.

Do you realize that you need to do IO to get (and then be able to notice) IO errors?

Cheers,

	Lars

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed