Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
I have a drbd resource and am seeing unexpected behavior during failure that I'm hoping someone can help with. In this particular resource I don't need complete durability (i.e. it is OK for secondary to catch up when it comes back online) and am thus using protocol A and have the DRBD device mounted over NFS. Shortly after performing a simple failure, drbdadm down <resource>, on the secondary I am unable to write to the filesystem on the primary because it is now a "Read-only file system". Initially the writes continue on the primary as expected but after a couple seconds I see the following error: node01:~ # touch /shared0/tom touch: cannot touch `/shared0/tom': Read-only file system There is no change when bringing the secondary back on-line even though /proc/drbd does state that resources are Consistent node01:~ # cat /proc/drbd version: 0.7-pre8 (api:74/proto:72) 0: cs:Connected st:Secondary/Secondary ld:Consistent ns:0 nr:0 dw:8 dr:145 al:1 bm:1 lo:0 pe:0 ua:0 ap:0 1: cs:Connected st:Secondary/Secondary ld:Consistent ns:0 nr:369092 dw:369092 dr:0 al:0 bm:6 lo:0 pe:0 ua:0 ap:0 2: cs:Connected st:Primary/Secondary ld:Consistent ns:0 nr:0 dw:985192 dr:4552253 al:247 bm:554 lo:0 pe:0 ua:0 ap:0 node01:~ # node02:~ # cat /proc/drbd version: 0.7-pre8 (api:74/proto:72) 0: cs:Connected st:Secondary/Secondary ld:Consistent ns:0 nr:0 dw:1840516 dr:5163374 al:251 bm:1049 lo:0 pe:0 ua:0 ap:0 1: cs:Connected st:Secondary/Secondary ld:Consistent ns:0 nr:12 dw:12 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 2: cs:Connected st:Secondary/Primary ld:Consistent ns:0 nr:5309384 dw:5309384 dr:0 al:0 bm:554 lo:0 pe:0 ua:0 ap:0 node02:~ # Here is the current configuration node01:~ # drbdadm dump r0 resource r0 { protocol A; incon-degr-cmd "halt -f"; on node01 { device /dev/nb2; disk /dev/hda7; address 9.42.114.96:7790; meta-disk internal; } on drbdHost_2 { device /dev/nb2; disk /dev/hda7; address 9.42.114.123:7790; meta-disk internal; } disk { on-io-error detach; } syncer { rate 100M; group 0; al-extents 257; } startup { degr-wfc-timeout 120; } } node02:~ # drbdadm dump r0 resource r0 { protocol A; incon-degr-cmd "halt -f"; on node02 { device /dev/nb2; disk /dev/hda7; address 9.42.114.123:7790; meta-disk internal; } on drbdHost_1 { device /dev/nb2; disk /dev/hda7; address 9.42.114.96:7790; meta-disk internal; } disk { on-io-error detach; } syncer { rate 100M; group 0; al-extents 257; } startup { degr-wfc-timeout 120; } } For clarification, node02 is an alias to drbdHost_2 in both machines /etc/host and node01 is an alias to drbdHost_1 and things work well in a non-failed state. I've tried resolving the problem by invalidating the data on the secondary but that didn't work. The only way I've discovered to get out of this state is to make the resource secondary on both nodes followed by making one of the two primary. I'm using SLES 9 drbd rpm 0.7.0-59.22 node02:~ # rpm -qa | grep drbd drbd-0.7.0-59.22 Thanks in advance for any help you can offer. As an aside I have seen the same error with a resource using protocal C not over NFS when performing a hard power down, plug the power plug, on Thanks, Tom