Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Ok - the test case is ...
Primary:
--------
$ cat /proc/drbd
version: 8.3.10 (api:88/proto:86-96)
GIT-hash: 5c0b0469666682443d4785d90a2c603378f9017b build by buildsystem at linbit, 2011-01-28 12:28:22
0: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
ns:102060 nr:0 dw:7168 dr:119461 al:39 bm:201 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:0
$ drbdsetup 0 show | grep -v _is_default
disk {
on-io-error detach;
no-disk-barrier ;
no-disk-flushes ;
no-md-flushes ;
}
net {
max-epoch-size 20000;
max-buffers 32000;
unplug-watermark 16;
ko-count 6;
cram-hmac-alg "sha1";
shared-secret "wurscht";
after-sb-0pri discard-zero-changes;
after-sb-1pri consensus;
data-integrity-alg "sha1";
}
syncer {
rate 81920k; # bytes/second
al-extents 3389;
csums-alg "sha1";
verify-alg "sha1";
cpu-mask "ff";
}
protocol C;
_this_host {
device minor 0;
disk "/dev/vgsys/lv_drbd_s10_disk";
meta-disk internal;
address ipv4 10.0.89.10:7789;
}
_remote_host {
address ipv4 10.0.89.15:7789;
}
Secondary:
----------
$ cat /proc/drbd
version: 8.3.10 (api:88/proto:86-96)
GIT-hash: 5c0b0469666682443d4785d90a2c603378f9017b build by buildsystem at linbit, 2011-01-28 12:28:22
0: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-----
ns:0 nr:110405 dw:110405 dr:98196 al:0 bm:199 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:0
$ drbdsetup 0 show | grep -v _is_default
disk {
on-io-error detach;
no-disk-barrier ;
no-disk-flushes ;
no-md-flushes ;
}
net {
max-epoch-size 20000;
max-buffers 32000;
unplug-watermark 16;
ko-count 6;
cram-hmac-alg "sha1";
shared-secret "wurscht";
after-sb-0pri discard-zero-changes;
after-sb-1pri consensus;
data-integrity-alg "sha1";
}
syncer {
rate 81920k; # bytes/second
al-extents 3389;
csums-alg "sha1";
verify-alg "sha1";
cpu-mask "ff";
}
protocol C;
_this_host {
device minor 0;
disk "/dev/vgsys/lv_drbd_s15_disk";
meta-disk internal;
address ipv4 10.0.89.15:7789;
}
_remote_host {
address ipv4 10.0.89.10:7789;
}
/dev/drbd0 is PV for LVM VG vg_drbd_s10_vols and
I use one LV test01 for write test with dd (a kind of
simulation the real world kvm guests).
Freeze DRBD Backing Disk on Secondary:
--------------------------------------
$ dmsetup suspend vgsys-lv_drbd_s15_disk
$ dmsetup info vgsys-lv_drbd_s15_disk
Name: vgsys-lv_drbd_s15_disk
State: SUSPENDED
Read Ahead: 256
Tables present: LIVE
Open count: 2
Event number: 0
Major, minor: 253, 6
Number of targets: 1
UUID: LVM-NpXIxTWZFkG1HAhuPcLvxM056SVNR0Szawx7ng730TdUIeuUfCJT1pfccGXvMt7Q
Do some writes on Primary until it blocks:
------------------------------------------
$ dd if=/dev/urandom of=/dev/vg_drbd_s10_vols/test01 oflag=direct bs=512 count=1000 skip=$RANDOM &
[2] 13878
$ dd if=/dev/urandom of=/dev/vg_drbd_s10_vols/test01 oflag=direct bs=512 count=1000 skip=$RANDOM &
[3] 13879
PRMARY$ cat /proc/drbd
version: 8.3.10 (api:88/proto:86-96)
GIT-hash: 5c0b0469666682443d4785d90a2c603378f9017b build by buildsystem at linbit, 2011-01-28 12:28:22
0: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
ns:143669 nr:0 dw:48777 dr:553620 al:204 bm:201 lo:0 pe:3 ua:0 ap:3 ep:1 wo:d oos:0
SECONDARY$ cat /proc/drbd
version: 8.3.10 (api:88/proto:86-96)
GIT-hash: 5c0b0469666682443d4785d90a2c603378f9017b build by buildsystem at linbit, 2011-01-28 12:28:22
0: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-----
ns:0 nr:143669 dw:143668 dr:98196 al:0 bm:199 lo:3 pe:0 ua:3 ap:0 ep:1 wo:d oos:0
Waiting about 15 Minutes now und status is the same.
No StandAlone triggered by ko-count decremented to zero after 6*6 sec.
No messages in syslog except:
...
Feb 15 17:09:12 bach-s10 kernel: block drbd0: dd[13878] Concurrent local write detected! [DISCARD L] new: 206868480s +512; pending: 206868480s +512
...
Which are triggerd by "dd if=/dev/urandom of=/dev/vg_drbd_s10_vols/test01 oflag=direct bs=512 count=1 skip=$RANDOM"
writing on the same locations as by blocked ones
But thats "normal" for the dd overwriting the same locations?
*Mhh* - maybe someone can reproduce this?
Kind Regards,
Roland
Am Dienstag 15 Februar 2011 schrieb Lars Ellenberg:
> On Tue, Feb 15, 2011 at 12:23:10PM +0100, Roland Friedwagner wrote:
> > Hello,
> >
> > I stumbled over this one, when the firmware of a storage controller
> > on a drbd secondaryy gets upgraded and freezes io for about 1
> > minute. Because drbd is the storage base of a kvm cluster (via
> > iscsi) the load of all guests goes up very high and all writing
> > prozesses in the guest freezes (thats pretty ok - i think;-) until
> > io again flows on upgraded controller.
> >
> > But what I've expected to happen is with ko-count set to 6 and
> > timeout is default of 6sec, that the primary will go to StandAlone
> > mode after 36 seconds. But this does _not_ happen :-O
> >
> > drbd.conf man page states:
> > ko-count number
> > In case the secondary node fails to complete a single write
> > request for count times the timeout, it is expelled from the
> > cluster. (I.e. the primary node goes into StandAlone mode.)
> > The default value is 0, which disables this feature.
> >
> > I prepared a test case and reproduce the same behavor by suspend
> > io via dmsetup on a secondary with a lvm backed backing device.
> >
> > So it looks like a bug?
> > (But maybe a missed something here ;-)
>
> Care to show logs + /proc/drbd,
> or the test case itself?
--
Roland.Friedwagner at wu.ac.at Phone: +43 1 31336 5377
IT Services - WU (Vienna University of Economics and Business)