[DRBD-user] DRBD - Read-only mode file system after failure

Mon Oct 17 16:28:06 CEST 2005

I have a drbd resource and am seeing unexpected behavior during failure that 
I'm hoping someone can help with.  In this particular resource I don't need 
complete durability (i.e. it is OK for secondary to catch up when it comes back 
online) and am thus using protocol A and have the DRBD device mounted over NFS. 
Shortly after performing a simple failure, drbdadm down <resource>,  on the 
secondary I am unable to write to the filesystem on the primary because it is 
now a "Read-only file system". Initially the writes continue on the primary as 
expected but after a couple seconds I see the following error: 

node01:~ # touch /shared0/tom 
touch: cannot touch `/shared0/tom': Read-only file system 

There is no change when bringing the secondary back on-line  even 
though /proc/drbd does state that resources are Consistent 

node01:~ # cat /proc/drbd 
version: 0.7-pre8 (api:74/proto:72) 

 0: cs:Connected st:Secondary/Secondary ld:Consistent 
    ns:0 nr:0 dw:8 dr:145 al:1 bm:1 lo:0 pe:0 ua:0 ap:0 
 1: cs:Connected st:Secondary/Secondary ld:Consistent 
    ns:0 nr:369092 dw:369092 dr:0 al:0 bm:6 lo:0 pe:0 ua:0 ap:0 
 2: cs:Connected st:Primary/Secondary ld:Consistent 
    ns:0 nr:0 dw:985192 dr:4552253 al:247 bm:554 lo:0 pe:0 ua:0 ap:0 
node01:~ #                                            

node02:~ # cat /proc/drbd 
version: 0.7-pre8 (api:74/proto:72) 

 0: cs:Connected st:Secondary/Secondary ld:Consistent 
    ns:0 nr:0 dw:1840516 dr:5163374 al:251 bm:1049 lo:0 pe:0 ua:0 ap:0 
 1: cs:Connected st:Secondary/Secondary ld:Consistent 
    ns:0 nr:12 dw:12 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 
 2: cs:Connected st:Secondary/Primary ld:Consistent 
    ns:0 nr:5309384 dw:5309384 dr:0 al:0 bm:554 lo:0 pe:0 ua:0 ap:0 
node02:~ # 

Here is the current configuration 

node01:~ # drbdadm dump r0 
resource r0 { 
    protocol               A; 
    incon-degr-cmd       "halt -f"; 
    on node01 { 
        device           /dev/nb2; 
        disk             /dev/hda7; 
        address          9.42.114.96:7790; 
        meta-disk        internal; 
    } 
    on drbdHost_2 { 
        device           /dev/nb2; 
        disk             /dev/hda7; 
        address          9.42.114.123:7790; 
        meta-disk        internal; 
    } 
    disk { 
        on-io-error      detach; 
    } 
    syncer { 
        rate             100M; 
        group              0; 
        al-extents       257; 
    } 
    startup { 
        degr-wfc-timeout 120; 
    } 
} 

node02:~ # drbdadm dump r0 
resource r0 { 
    protocol               A; 
    incon-degr-cmd       "halt -f"; 
    on node02 { 
        device           /dev/nb2; 
        disk             /dev/hda7; 
        address          9.42.114.123:7790; 
        meta-disk        internal; 
    } 
    on drbdHost_1 { 
        device           /dev/nb2; 
        disk             /dev/hda7; 
        address          9.42.114.96:7790; 
        meta-disk        internal; 
    } 
    disk { 
        on-io-error      detach; 
    } 
    syncer { 
        rate             100M; 
        group              0; 
        al-extents       257; 
    } 
    startup { 
        degr-wfc-timeout 120; 
    } 
} 

For clarification, node02 is an alias to  drbdHost_2 in both machines /etc/host 
and node01 is an alias to drbdHost_1 and things work well in a non-failed 
state. 

I've tried resolving the problem by invalidating the data on the secondary but 
that didn't work.  The only way I've discovered to get out of this state is to 
make the resource secondary on both nodes followed by making one of the two 
primary. 

I'm using SLES 9 drbd rpm 0.7.0-59.22 

node02:~ # rpm -qa | grep drbd 
drbd-0.7.0-59.22 

Thanks in advance for any help you can offer. 

As an aside I have seen the same error with a resource using protocal C not 
over NFS when performing a hard power down, plug the power plug, on  

Thanks, 
Tom