Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Dear DRBD users and developers, We have 2 clusters with several "simple" resources (single master, single resource per connection). Each resource contains either an LXC container, or a VMWare virtual machine (VMWare Workstation v9.0.2). We run "drbdadm verify all" every week-end and we noticed that the resources hosting VMWare machines often have out-of-sync blocs, and the ones hosting LXCs never have any. We've seen this on both clusters (config quoted below). I was wondering what could be causing these out-of-sync blocs? Can VMWare possibly be modifying in-flight data? Is there a way I can make sure? Thanks in advance if someone can shed some light on this issue. Lionel Sausin. --- Config on the oldest cluster: Ubuntu 10.04, kernel 2.6.32 and DRBD 8.3.13. Its resources use ext4 mounted with -o nobarrier. The storage is a hardware RAID10 on SSDs. global { usage-count yes; } common { protocol C; # Actions to take in the face of special events handlers { pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f"; pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f"; local-io-error "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f"; split-brain "/usr/lib/drbd/notify-split-brain.sh <<<cut>>>"; out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh <<<cut>>>"; before-resync-target "/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 5 -- -c 16k"; after-resync-target /usr/lib/drbd/unsnapshot-resync-target-lvm.sh; } startup { } net { # Restrict access to the resources with a shared secret cram-hmac-alg md5; shared-secret <<<cut>>>; # Congestion management lets writes flow without disconnecting # on-congestion pull-ahead; # congestion-fill 1M; # Go StandAlone if the peer is enreachable too long ko-count 10; # Allow 6 seconds for the other node's reply before we drop connections timeout 60; } syncer { # Compress the dirty-bitmaps use-rle; # Use checksumming to allow online verification # sha1 has fewer chances of hash collision but is CPU-hungry (Noa's CPU can only process up to 60MB/s) verify-alg md5; # Resync checksuming while verifying used lead to a deadlock, fixed in v8.3.11 csums-alg md5; # Adaptative syncer rate: let DRBD decide the best sync speed # initial sync rate rate 50M; # size of the rate adaptation window c-plan-ahead 20; # min/max rate # The network will allow only up to ~110MB/s, but verify and identical-bloc resyncs use very little network BW c-max-rate 800M; # quantity of sync data to maintain in the buffers (impacts the length of the wait queue) c-fill-target 100k; # Limit the bandwidth available for resync on the primary node when DRBD detects application I/O c-min-rate 8M; al-extents 1023; } } Typical resource config: resource openerp { device /dev/drbd_openerp minor 0; meta-disk internal; on NodeA { address 10.100.1.2:7788; disk /dev/fast_vol/openerp; } on NodeB { address 10.100.1.3:7788; disk /dev/fast_vol/openerp; } } --- Config on the newest cluster: Ubuntu 12.04, kernel 3.8 (raring stack) and DRBD 8.4.2. Its resources use ext4 with default options, and were created with -b 4096 -E stride=64,stripe-width=192. The storage is a SSD on the primary node, hardware RAID5 on the other side. global_common.conf: global { usage-count yes; } common { handlers { pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f"; pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f"; local-io-error "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f"; split-brain "/usr/lib/drbd/notify-split-brain.sh <<<cut>>>"; out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh <<<cut>>>"; before-resync-target "/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k"; after-resync-target /usr/lib/drbd/unsnapshot-resync-target-lvm.sh; } startup { } options { } disk { } net { # Peer authentication cram-hmac-alg sha1; shared-secret <<<cut>>>; # Skip sync when checksum matches csums-alg sha1; # Enable online verify verify-alg md5; } } Typical resource config: resource web { device /dev/drbd_web minor 4; # Master on vmhost7 { address 10.100.0.14:7804; disk /dev/data/web; meta-disk internal; } # Slave on stockagec { address 10.100.0.13:7804; disk /dev/data1/web; meta-disk internal; } }