Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Ok - the test case is ... Primary: -------- $ cat /proc/drbd version: 8.3.10 (api:88/proto:86-96) GIT-hash: 5c0b0469666682443d4785d90a2c603378f9017b build by buildsystem at linbit, 2011-01-28 12:28:22 0: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r----- ns:102060 nr:0 dw:7168 dr:119461 al:39 bm:201 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:0 $ drbdsetup 0 show | grep -v _is_default disk { on-io-error detach; no-disk-barrier ; no-disk-flushes ; no-md-flushes ; } net { max-epoch-size 20000; max-buffers 32000; unplug-watermark 16; ko-count 6; cram-hmac-alg "sha1"; shared-secret "wurscht"; after-sb-0pri discard-zero-changes; after-sb-1pri consensus; data-integrity-alg "sha1"; } syncer { rate 81920k; # bytes/second al-extents 3389; csums-alg "sha1"; verify-alg "sha1"; cpu-mask "ff"; } protocol C; _this_host { device minor 0; disk "/dev/vgsys/lv_drbd_s10_disk"; meta-disk internal; address ipv4 10.0.89.10:7789; } _remote_host { address ipv4 10.0.89.15:7789; } Secondary: ---------- $ cat /proc/drbd version: 8.3.10 (api:88/proto:86-96) GIT-hash: 5c0b0469666682443d4785d90a2c603378f9017b build by buildsystem at linbit, 2011-01-28 12:28:22 0: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r----- ns:0 nr:110405 dw:110405 dr:98196 al:0 bm:199 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:0 $ drbdsetup 0 show | grep -v _is_default disk { on-io-error detach; no-disk-barrier ; no-disk-flushes ; no-md-flushes ; } net { max-epoch-size 20000; max-buffers 32000; unplug-watermark 16; ko-count 6; cram-hmac-alg "sha1"; shared-secret "wurscht"; after-sb-0pri discard-zero-changes; after-sb-1pri consensus; data-integrity-alg "sha1"; } syncer { rate 81920k; # bytes/second al-extents 3389; csums-alg "sha1"; verify-alg "sha1"; cpu-mask "ff"; } protocol C; _this_host { device minor 0; disk "/dev/vgsys/lv_drbd_s15_disk"; meta-disk internal; address ipv4 10.0.89.15:7789; } _remote_host { address ipv4 10.0.89.10:7789; } /dev/drbd0 is PV for LVM VG vg_drbd_s10_vols and I use one LV test01 for write test with dd (a kind of simulation the real world kvm guests). Freeze DRBD Backing Disk on Secondary: -------------------------------------- $ dmsetup suspend vgsys-lv_drbd_s15_disk $ dmsetup info vgsys-lv_drbd_s15_disk Name: vgsys-lv_drbd_s15_disk State: SUSPENDED Read Ahead: 256 Tables present: LIVE Open count: 2 Event number: 0 Major, minor: 253, 6 Number of targets: 1 UUID: LVM-NpXIxTWZFkG1HAhuPcLvxM056SVNR0Szawx7ng730TdUIeuUfCJT1pfccGXvMt7Q Do some writes on Primary until it blocks: ------------------------------------------ $ dd if=/dev/urandom of=/dev/vg_drbd_s10_vols/test01 oflag=direct bs=512 count=1000 skip=$RANDOM & [2] 13878 $ dd if=/dev/urandom of=/dev/vg_drbd_s10_vols/test01 oflag=direct bs=512 count=1000 skip=$RANDOM & [3] 13879 PRMARY$ cat /proc/drbd version: 8.3.10 (api:88/proto:86-96) GIT-hash: 5c0b0469666682443d4785d90a2c603378f9017b build by buildsystem at linbit, 2011-01-28 12:28:22 0: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r----- ns:143669 nr:0 dw:48777 dr:553620 al:204 bm:201 lo:0 pe:3 ua:0 ap:3 ep:1 wo:d oos:0 SECONDARY$ cat /proc/drbd version: 8.3.10 (api:88/proto:86-96) GIT-hash: 5c0b0469666682443d4785d90a2c603378f9017b build by buildsystem at linbit, 2011-01-28 12:28:22 0: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r----- ns:0 nr:143669 dw:143668 dr:98196 al:0 bm:199 lo:3 pe:0 ua:3 ap:0 ep:1 wo:d oos:0 Waiting about 15 Minutes now und status is the same. No StandAlone triggered by ko-count decremented to zero after 6*6 sec. No messages in syslog except: ... Feb 15 17:09:12 bach-s10 kernel: block drbd0: dd[13878] Concurrent local write detected! [DISCARD L] new: 206868480s +512; pending: 206868480s +512 ... Which are triggerd by "dd if=/dev/urandom of=/dev/vg_drbd_s10_vols/test01 oflag=direct bs=512 count=1 skip=$RANDOM" writing on the same locations as by blocked ones But thats "normal" for the dd overwriting the same locations? *Mhh* - maybe someone can reproduce this? Kind Regards, Roland Am Dienstag 15 Februar 2011 schrieb Lars Ellenberg: > On Tue, Feb 15, 2011 at 12:23:10PM +0100, Roland Friedwagner wrote: > > Hello, > > > > I stumbled over this one, when the firmware of a storage controller > > on a drbd secondaryy gets upgraded and freezes io for about 1 > > minute. Because drbd is the storage base of a kvm cluster (via > > iscsi) the load of all guests goes up very high and all writing > > prozesses in the guest freezes (thats pretty ok - i think;-) until > > io again flows on upgraded controller. > > > > But what I've expected to happen is with ko-count set to 6 and > > timeout is default of 6sec, that the primary will go to StandAlone > > mode after 36 seconds. But this does _not_ happen :-O > > > > drbd.conf man page states: > > ko-count number > > In case the secondary node fails to complete a single write > > request for count times the timeout, it is expelled from the > > cluster. (I.e. the primary node goes into StandAlone mode.) > > The default value is 0, which disables this feature. > > > > I prepared a test case and reproduce the same behavor by suspend > > io via dmsetup on a secondary with a lvm backed backing device. > > > > So it looks like a bug? > > (But maybe a missed something here ;-) > > Care to show logs + /proc/drbd, > or the test case itself? -- Roland.Friedwagner at wu.ac.at Phone: +43 1 31336 5377 IT Services - WU (Vienna University of Economics and Business)