[DRBD-user] Secondary node io-error

Thu Oct 11 04:01:43 CEST 2012

On Oct 10, 2012, at 5:01 AM, Lars Ellenberg wrote:

> On Wed, Oct 10, 2012 at 03:42:02AM +0000, Velayutham, Prakash wrote:
>> 
>> On Oct 8, 2012, at 9:19 AM, Velayutham, Prakash wrote:
>> 
>>> On Oct 8, 2012, at 4:55 AM, Lars Ellenberg wrote:
>>> 
>>>> On Sat, Oct 06, 2012 at 01:08:43PM +0000, Velayutham, Prakash wrote:
>>>>> Hi,
>>>>> 
>>>>> I recently got a DRBD (8.4.2-2) cluster up (still testing). It seems to work nicely with Pacemaker CRM in several scenarios I have tested. Here is my config.
>>>>> 
>>>>> global {
>>>>>              usage-count     yes;
>>>>> }
>>>>> 
>>>>> common {
>>>>>      handlers {
>>>>>              outdate-peer    /usr/lib/drbd/crm-fence-peer.sh;
>>>>>              fence-peer      /usr/lib/drbd/crm-fence-peer.sh;
>>>>>              after-resync-target     /usr/lib/drbd/crm-unfence-peer.sh;
>>>>>              local-io-error "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f";
>>>>>              split-brain "/usr/lib/drbd/notify-split-brain.sh root";
>>>>>      }
>>>>> 
>>>>>      startup {
>>>>>              degr-wfc-timeout        0;
>>>>>      }
>>>>> 
>>>>>      net {
>>>>>              shared-secret   1QP69G4kWDslx2TMiaEStI6bwaGH5y8d;
>>>>>              after-sb-0pri discard-zero-changes;
>>>>>              after-sb-1pri discard-secondary;
>>>>>              after-sb-2pri disconnect;
>>>>>      }
>>>>> 
>>>>>      disk {
>>>>>              on-io-error     call-local-io-error;
>>>>>              fencing resource-and-stonith;
>>>>>      }
>>>>> 
>>>>> }
>>>>> 
>>>>> The io-error handler only gets called when the primary node has a disk
>>>>> issue. I have not seen the secondary node call the "local-io-error"
>>>>> handler when it had disk access issues. Is this by design?
>>>> 
>>>> No.
>>>> 
>>>> "Works for me", though.
>>>> 
>>>> Can you please double check?
>>>> And if in fact you can reproduce, tell us how, including logs?
> 
>>> If I disable all the FC ports in the fiber switch just for the
>>> primary node, the node fences, reboots and comes up, as I would
>>> expect. With the exact same config, if I disable the FC ports just
>>> for the secondary node, the node just sits there and it even shows
>>> up as Secondary in /proc/drbd.
> 
>>> That sounds odd and sounds like the
>>> config should be "diskless", but it is "call-local-io-error".
> 
> Huh? What has "config" to do with things,
> and what exactly is "config diskless"?
> 
> 
>>> Which logs are you wanting me to share?
> 
> Those that show DRBD detecting an IO error,
> but not calling the io-error handler.
> 
>>> Thanks,
>>> Prakash
>> 
>> Just wanted to add this. I repeated my test again and get the exact
>> same results again. Here is /proc/drbd of the primary (bmimysqlt3) and
>> secondary (bmimysqlt4) before the secondary's disk is cut off
>> (disabling the fiber switch port that the secondary is connected to)
>> 
>> [root at bmimysqlt3 ~]# cat /proc/drbd 
>> version: 8.4.2 (api:1/proto:86-101)
>> GIT-hash: 7ad5f850d711223713d6dcadc3dd48860321070c build by root at bmimysqlt3.chmcres.cchmc.org, 2012-10-02 00:02:32
>> 0: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
>>    ns:184 nr:0 dw:160 dr:14317 al:6 bm:6 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
>> 
>> [root at bmimysqlt4 ~]# cat /proc/drbd 
>> version: 8.4.2 (api:1/proto:86-101)
>> GIT-hash: 7ad5f850d711223713d6dcadc3dd48860321070c build by root at bmimysqlt3.chmcres.cchmc.org, 2012-10-02 00:02:32
>> 0: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-----
>>    ns:0 nr:184 dw:184 dr:0 al:0 bm:6 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
>> 
>> Here is /proc/drbd of primary and secondary about 5 minutes after the disk is cut off.
>> 
>> [root at bmimysqlt3 ~]# cat /proc/drbd 
>> version: 8.4.2 (api:1/proto:86-101)
>> GIT-hash: 7ad5f850d711223713d6dcadc3dd48860321070c build by root at bmimysqlt3.chmcres.cchmc.org, 2012-10-02 00:02:32
>> 0: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
>>    ns:184 nr:0 dw:160 dr:14317 al:6 bm:6 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
> 
>  No additional writes.
> 
>> [root at bmimysqlt4 ~]# cat /proc/drbd 
>> version: 8.4.2 (api:1/proto:86-101)
>> GIT-hash: 7ad5f850d711223713d6dcadc3dd48860321070c build by root at bmimysqlt3.chmcres.cchmc.org, 2012-10-02 00:02:32
>> 0: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-----
>>    ns:0 nr:184 dw:184 dr:0 al:0 bm:6 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
> 
> Nothing transfered, nothing written, nothing changed.
> 
>> As you can see, there is absolutely nothing there to suggest that the
>> secondary even noticed the io-error.
>> 
>> I can't understand what is going on.
> 
> Do you realize that you need to do IO to get (and then be able to notice) IO errors?
> 
> Cheers,
> 
> 	Lars

Wow, feeling like an idiot now. Sorry for the false alarm. I just got confused because the primary node got fenced right away without any sort of manual write operation from me, but the secondary did not exhibit that same behavior.

Thanks,
Prakash