Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi, Lars Thank you for the response. The proposed solution does not help. As far as I understand DRBD syncs all data bit to bit between peers and write operation in C protocol returns only when the write was returned successfully by underlying block device . Is it true? Which part of hardware may be of responsibility? Is it RAID or may it be Network? It is difficult to accept that RAID controller is faulty because the main system uses the same adapter and is absolutely consistent. I turned off all the offload using ethtool and it did not help. 16.05.2012 16:37, Lars Ellenberg wrote: > On Wed, May 16, 2012 at 11:58:36AM +0400, Vladimir Kuklin wrote: >> Hi guys >> >> I am experiencing very strange behaviour of DRBD. >> >> My setup is: >> >> >> Two absolutely identical Supermicro nodes using LSI 9265-8i >> controller with SAS disks. >> >> Scientific Linux 6.2 with latest OpenVZ stable kernel that uses drbd 8.3.10. >> >> the excerpt from my drbd.conf is following: >> >> ... >> resource r1 { >> net >> { >> max-buffers 8000; >> max-epoch-size 8000; >> sndbuf-size 2M; >> allow-two-primaries; >> after-sb-0pri discard-zero-changes; >> after-sb-1pri discard-secondary; >> after-sb-2pri disconnect; >> # data-integrity-alg crc32c; >> ping-int 25; >> } >> startup { >> become-primary-on both; >> } >> >> >> >> syncer { >> rate 100M; >> al-extents 3383; >> csums-alg crc32c; >> verify-alg crc32c; >> } >> >> disk { >> fencing resource-only; >> # no-disk-barrier; >> # no-disk-flushes; >> } >> handlers >> { >> split-brain "/usr/lib/drbd/notify-split-brain.sh root"; >> out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root"; >> # fence-peer "/usr/lib/drbd/crm-fence-peer.sh"; >> # after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh"; >> } >> protocol C; >> >> on srv10 { >> device /dev/drbd1; >> disk /dev/sdf1; >> address 192.168.27.11:7789; >> meta-disk internal; >> } >> >> on srv11 { >> device /dev/drbd1; >> disk /dev/sdf1; >> address 192.168.27.12:7789; >> meta-disk internal; >> } >> } >> ... >> >> The problem is that after initial synchronization if I run "drbdadm >> verify r1", I get a bunch of out-of-sync blocks. Then I do >> disconnect and connect of this resource and run "drbdadm verify r1" >> again and then I again do get a bunch of out-of-sync blocks. Some of >> them are false-positives and some of them are really out of sync as >> dd shows (both with iflags=direct and without). And what is >> important: there is no write operations on this device. Nothing is >> written to it, but I get a bunch of out-of-sync blocks any time I >> run verify and resync DRBD. >> >> [root at srv10 vvk]# drbdadm verify r1 >> [root at srv10 vvk]# cat /proc/drbd >> version: 8.3.10 (api:88/proto:86-96) >> GIT-hash: 5c0b0469666682443d4785d90a2c603378f9017b build by >> phil at fat-tyre, 2011-01-28 12:17:35 >> >> 1: cs:VerifyS ro:Primary/Secondary ds:UpToDate/UpToDate C r----- >> ns:1568452 nr:0 dw:4 dr:39777412 al:1 bm:1300 lo:333 pe:0 ua:367 >> ap:0 ep:1 wo:b oos:484868 >> [>....................] verified: 2.3% (9328/9536)M >> finish: 0:01:29 speed: 106,784 (106,784) want: 102,400 K/sec >> [root at srv10 vvk]# cat /proc/drbd >> version: 8.3.10 (api:88/proto:86-96) >> GIT-hash: 5c0b0469666682443d4785d90a2c603378f9017b build by >> phil at fat-tyre, 2011-01-28 12:17:35 >> >> 1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r----- >> ns:1568452 nr:0 dw:4 dr:49330856 al:1 bm:1300 lo:0 pe:0 ua:0 >> ap:0 ep:1 wo:b oos:724448 >> [root at srv10 vvk]# dmesg | tail -10 >> >> [ 2267.737437] block drbd1: helper command: /sbin/drbdadm >> before-resync-source minor-1 exit code 0 (0x0) >> [ 2267.737445] block drbd1: conn( WFBitMapS -> SyncSource ) pdsk( >> Consistent -> Inconsistent ) >> [ 2267.737455] block drbd1: Began resync as SyncSource (will sync >> 724448 KB [181112 bits set]). >> [ 2267.737494] block drbd1: updated sync UUID >> 5B5B2B2EE43F8757:7BE37645CD25FF6D:7BE27645CD25FF6D:0001000000000000 >> [ 2276.569106] block drbd1: Resync done (total 8 sec; paused 0 sec; >> 90556 K/sec) >> [ 2276.569113] block drbd1: 51 % had equal check sums, eliminated: >> 371216K; transferred 353232K total 724448K >> [ 2276.569120] block drbd1: updated UUIDs >> 5B5B2B2EE43F8757:0000000000000000:7BE37645CD25FF6D:7BE27645CD25FF6D >> [ 2276.569129] block drbd1: conn( SyncSource -> Connected ) pdsk( >> Inconsistent -> UpToDate ) >> [ 2276.569727] block drbd1: bitmap WRITE of 52 pages took 0 jiffies >> [ 2276.569750] block drbd1: 0 KB (0 bits) marked out-of-sync by on >> disk bit-map. >> >> [root at srv10 vvk]# cat /proc/drbd >> version: 8.3.10 (api:88/proto:86-96) >> GIT-hash: 5c0b0469666682443d4785d90a2c603378f9017b build by >> phil at fat-tyre, 2011-01-28 12:17:35 >> >> 1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r----- >> ns:353232 nr:0 dw:4 dr:50055636 al:1 bm:1410 lo:0 pe:0 ua:0 ap:0 >> ep:1 wo:b oos:0 >> [root at srv10 vvk]# drbdadm verify r1 >> >> [root at srv10 vvk]# cat /proc/drbd >> version: 8.3.10 (api:88/proto:86-96) >> GIT-hash: 5c0b0469666682443d4785d90a2c603378f9017b build by >> phil at fat-tyre, 2011-01-28 12:17:35 >> >> 1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r----- >> ns:353232 nr:0 dw:4 dr:59822788 al:1 bm:1410 lo:0 pe:0 ua:0 ap:0 >> ep:1 wo:b oos:470532 >> >> >> Is this normal DRBD behaviour? > No. > That is most likely broken hardware. > > Still: try a verify with md5/sha1, > and a resync without csums-alg, > then an additional verify, > and no other interaction with the device in between. > > If that still shows out of sync blocks, > I'd strongly suggest you start > replacing hardware components. > > -- With best regards Vladimir Kuklin Senior Linux Administrator LLC "Mango Telecom", tel.: +7 (495) 540-44-44 ext. 3576 mob.: +7 (925) 083-86-53 email: vl.kuklin at mangotele.com www.mango.ru