Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
I have a drbd resource and am seeing unexpected behavior during failure that
I'm hoping someone can help with. In this particular resource I don't need
complete durability (i.e. it is OK for secondary to catch up when it comes back
online) and am thus using protocol A and have the DRBD device mounted over NFS.
Shortly after performing a simple failure, drbdadm down <resource>, on the
secondary I am unable to write to the filesystem on the primary because it is
now a "Read-only file system". Initially the writes continue on the primary as
expected but after a couple seconds I see the following error:
node01:~ # touch /shared0/tom
touch: cannot touch `/shared0/tom': Read-only file system
There is no change when bringing the secondary back on-line even
though /proc/drbd does state that resources are Consistent
node01:~ # cat /proc/drbd
version: 0.7-pre8 (api:74/proto:72)
0: cs:Connected st:Secondary/Secondary ld:Consistent
ns:0 nr:0 dw:8 dr:145 al:1 bm:1 lo:0 pe:0 ua:0 ap:0
1: cs:Connected st:Secondary/Secondary ld:Consistent
ns:0 nr:369092 dw:369092 dr:0 al:0 bm:6 lo:0 pe:0 ua:0 ap:0
2: cs:Connected st:Primary/Secondary ld:Consistent
ns:0 nr:0 dw:985192 dr:4552253 al:247 bm:554 lo:0 pe:0 ua:0 ap:0
node01:~ #
node02:~ # cat /proc/drbd
version: 0.7-pre8 (api:74/proto:72)
0: cs:Connected st:Secondary/Secondary ld:Consistent
ns:0 nr:0 dw:1840516 dr:5163374 al:251 bm:1049 lo:0 pe:0 ua:0 ap:0
1: cs:Connected st:Secondary/Secondary ld:Consistent
ns:0 nr:12 dw:12 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0
2: cs:Connected st:Secondary/Primary ld:Consistent
ns:0 nr:5309384 dw:5309384 dr:0 al:0 bm:554 lo:0 pe:0 ua:0 ap:0
node02:~ #
Here is the current configuration
node01:~ # drbdadm dump r0
resource r0 {
protocol A;
incon-degr-cmd "halt -f";
on node01 {
device /dev/nb2;
disk /dev/hda7;
address 9.42.114.96:7790;
meta-disk internal;
}
on drbdHost_2 {
device /dev/nb2;
disk /dev/hda7;
address 9.42.114.123:7790;
meta-disk internal;
}
disk {
on-io-error detach;
}
syncer {
rate 100M;
group 0;
al-extents 257;
}
startup {
degr-wfc-timeout 120;
}
}
node02:~ # drbdadm dump r0
resource r0 {
protocol A;
incon-degr-cmd "halt -f";
on node02 {
device /dev/nb2;
disk /dev/hda7;
address 9.42.114.123:7790;
meta-disk internal;
}
on drbdHost_1 {
device /dev/nb2;
disk /dev/hda7;
address 9.42.114.96:7790;
meta-disk internal;
}
disk {
on-io-error detach;
}
syncer {
rate 100M;
group 0;
al-extents 257;
}
startup {
degr-wfc-timeout 120;
}
}
For clarification, node02 is an alias to drbdHost_2 in both machines /etc/host
and node01 is an alias to drbdHost_1 and things work well in a non-failed
state.
I've tried resolving the problem by invalidating the data on the secondary but
that didn't work. The only way I've discovered to get out of this state is to
make the resource secondary on both nodes followed by making one of the two
primary.
I'm using SLES 9 drbd rpm 0.7.0-59.22
node02:~ # rpm -qa | grep drbd
drbd-0.7.0-59.22
Thanks in advance for any help you can offer.
As an aside I have seen the same error with a resource using protocal C not
over NFS when performing a hard power down, plug the power plug, on
Thanks,
Tom