Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
We have a 2 node drbd (version 8.3.9) cluster. During failover last
night, secondary lost some files when promoted to primary and we had to
use our backups. I am trying to investigate it and noticed:
green ~ # cat /proc/drbd | head -n4
version: 8.3.9 (api:88/proto:86-95)
built-in
0: cs:Connected ro:Primary/Secondary ds:UpToDate/DUnknown C r-----
ns:505732 nr:0 dw:1054924 dr:787077381 al:38 bm:391 lo:0 pe:0 ua:0
ap:0 ep:1 wo:d oos:188492
red ~ # cat /proc/drbd | head -n4
version: 8.3.9 (api:88/proto:86-95)
built-in
0: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-----
ns:0 nr:358148 dw:505720 dr:786407892 al:0 bm:391 lo:0 pe:0 ua:0
ap:0 ep:1 wo:d oos:0
Notice that the states are different across the nodes (Unknown vs
UpToDate). In particular, please note the following log entry (full
logs below) which results in different states in nodes:
2011-05-05T15:11:56+00:00 green crm-fence-peer.sh[5873]: WARNING peer is
Secondary, did not place the constraint!
Googling lead to:
http://www.engardelinux.org/modules/index/list_archives.cgi?list=drbd-user&page=0156.html&month=2011-02
http://www.gossamer-threads.com/lists/drbd/users/20605#20605
but no solution on what to do. I suppose I can change fencing to
resource-and-stonith to see if it gives better results but freezing IO
to solve a problem in secondary does not look like a good option. Or
turn off digest integrity check but again this looks like a work around
and not a real solution. crm-fence-peer.sh seems to be the culprit
here.
Any suggestions? drbd.conf and logs are below. Thanks.
drbd.conf:
global {
usage-count yes;
}
common {
protocol C;
handlers {
pri-on-incon-degr "/usr/lib64/drbd/notify-pri-on-incon-degr.sh; /usr/lib64/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
pri-lost-after-sb "/usr/lib64/drbd/notify-pri-lost-after-sb.sh; /usr/lib64/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
local-io-error "/usr/lib64/drbd/notify-io-error.sh; /usr/lib64/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f";
fence-peer "/usr/lib64/drbd/crm-fence-peer.sh";
split-brain "/usr/lib64/drbd/notify-split-brain.sh root";
after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
}
syncer {
rate 80M;
verify-alg sha1;
al-extents 3389;
}
net {
data-integrity-alg sha1;
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
max-buffers 8000;
max-epoch-size 8000;
}
disk {
on-io-error call-local-io-error;
fencing resource-only;
no-disk-barrier;
no-disk-flushes;
no-md-flushes;
}
}
resource home {
meta-disk internal;
device /dev/drbd0;
disk /dev/vg1/home;
on green {
address 10.250.0.120:7789;
}
on red {
address 10.250.0.121:7789;
}
}
Node 1:
2011-05-05T15:11:55.084132+00:00 green kernel: [25918.176005] block drbd0: sock was shut down by peer
2011-05-05T15:11:55.084139+00:00 green kernel: [25918.176011] block drbd0: peer( Secondary -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown )
2011-05-05T15:11:55.084141+00:00 green kernel: [25918.176016] block drbd0: short read expecting header on sock: r=0
2011-05-05T15:11:55.084143+00:00 green kernel: [25918.176071] block drbd0: Creating new current UUID
2011-05-05T15:11:55.084144+00:00 green kernel: [25918.176093] block drbd0: meta connection shut down by peer.
2011-05-05T15:11:55.084145+00:00 green kernel: [25918.176095] block drbd0: asender terminated
2011-05-05T15:11:55.084146+00:00 green kernel: [25918.176097] block drbd0: Terminating drbd0_asender
2011-05-05T15:11:55.084147+00:00 green kernel: [25918.176261] block drbd0: Connection closed
2011-05-05T15:11:55.084148+00:00 green kernel: [25918.176353] block drbd0: conn( BrokenPipe -> Unconnected )
2011-05-05T15:11:55.084149+00:00 green kernel: [25918.176355] block drbd0: helper command: /sbin/drbdadm fence-peer minor-0
2011-05-05T15:11:55.084150+00:00 green kernel: [25918.176360] block drbd0: receiver terminated
2011-05-05T15:11:55.084151+00:00 green kernel: [25918.176362] block drbd0: Restarting drbd0_receiver
2011-05-05T15:11:55.084152+00:00 green kernel: [25918.176363] block drbd0: receiver (re)started
2011-05-05T15:11:55.084153+00:00 green kernel: [25918.176366] block drbd0: conn( Unconnected -> WFConnection )
2011-05-05T15:11:55+00:00 green crm-fence-peer.sh[5873]: invoked for home
2011-05-05T15:11:55+00:00 green cibadmin: [5877]: info: Invoked: cibadmin -Ql
2011-05-05T15:11:55+00:00 green cibadmin: [5886]: info: Invoked: cibadmin -Q -t 1
2011-05-05T15:11:55.384117+00:00 green kernel: [25918.472322] block drbd0: Handshake successful: Agreed network protocol version 95
2011-05-05T15:11:55.384133+00:00 green kernel: [25918.472333] block drbd0: conn( WFConnection -> WFReportParams )
2011-05-05T15:11:55.384135+00:00 green kernel: [25918.472461] block drbd0: Starting asender thread (from drbd0_receiver [3494])
2011-05-05T15:11:55.384137+00:00 green kernel: [25918.472966] block drbd0: data-integrity-alg: sha1
2011-05-05T15:11:55.384140+00:00 green kernel: [25918.473032] block drbd0: max_segment_size ( = BIO size ) = 65536
2011-05-05T15:11:55.384141+00:00 green kernel: [25918.473105] block drbd0: drbd_sync_handshake:
2011-05-05T15:11:55.384144+00:00 green kernel: [25918.473111] block drbd0: self D1234DBED1F69961:A89D201798F5DE81:091B6F96533FDB03:DB024609978BDBC3 bits:38 flags:0
2011-05-05T15:11:55.384147+00:00 green kernel: [25918.473117] block drbd0: peer A89D201798F5DE80:0000000000000000:091B6F96533FDB02:DB024609978BDBC3 bits:0 flags:0
2011-05-05T15:11:55.384149+00:00 green kernel: [25918.473122] block drbd0: uuid_compare()=1 by rule 70
2011-05-05T15:11:55.384152+00:00 green kernel: [25918.473129] block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> UpToDate )
2011-05-05T15:11:55.494062+00:00 green kernel: [25918.584770] block drbd0: conn( WFBitMapS -> SyncSource ) pdsk( UpToDate -> Inconsistent )
2011-05-05T15:11:55.494079+00:00 green kernel: [25918.584782] block drbd0: Began resync as SyncSource (will sync 152 KB [38 bits set]).
2011-05-05T15:11:55.514054+00:00 green kernel: [25918.603898] block drbd0: Resync done (total 1 sec; paused 0 sec; 152 K/sec)
2011-05-05T15:11:55.514065+00:00 green kernel: [25918.603907] block drbd0: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate )
2011-05-05T15:11:56+00:00 green cibadmin: [5899]: info: Invoked: cibadmin -Q -t 1
2011-05-05T15:11:56+00:00 green crm-fence-peer.sh[5873]: WARNING peer is Secondary, did not place the constraint!
2011-05-05T15:11:56.244254+00:00 green kernel: [25919.336391] block drbd0: helper command: /sbin/drbdadm fence-peer minor-0 exit code 1 (0x100)
2011-05-05T15:11:56.244264+00:00 green kernel: [25919.336394] block drbd0: fence-peer helper broken, returned 1
2011-05-05T15:11:56.244265+00:00 green kernel: [25919.336399] block drbd0: pdsk( UpToDate -> DUnknown )
2011-05-05T15:11:56.244266+00:00 green kernel: [25919.336465] block drbd0: Creating new current UUID
Node 2:
2011-05-05T15:11:55.092182+00:00 red kernel: [25931.638216] block drbd0: Digest integrity check FAILED.
2011-05-05T15:11:55.092192+00:00 red kernel: [25931.638228] block drbd0: error receiving Data, l: 4140!
2011-05-05T15:11:55.092196+00:00 red kernel: [25931.638236] block drbd0: peer( Primary -> Unknown ) conn( Connected -> ProtocolError ) pdsk( UpToDate -> DUnknown )
2011-05-05T15:11:55.092198+00:00 red kernel: [25931.638364] block drbd0: asender terminated
2011-05-05T15:11:55.092200+00:00 red kernel: [25931.638368] block drbd0: Terminating drbd0_asender
2011-05-05T15:11:55.092202+00:00 red kernel: [25931.638479] block drbd0: Connection closed
2011-05-05T15:11:55.092204+00:00 red kernel: [25931.638484] block drbd0: conn( ProtocolError -> Unconnected )
2011-05-05T15:11:55.092206+00:00 red kernel: [25931.638487] block drbd0: receiver terminated
2011-05-05T15:11:55.092208+00:00 red kernel: [25931.638489] block drbd0: Restarting drbd0_receiver
2011-05-05T15:11:55.092210+00:00 red kernel: [25931.638491] block drbd0: receiver (re)started
2011-05-05T15:11:55.092212+00:00 red kernel: [25931.638494] block drbd0: conn( Unconnected -> WFConnection )
2011-05-05T15:11:55.382048+00:00 red kernel: [25931.934973] block drbd0: Handshake successful: Agreed network protocol version 95
2011-05-05T15:11:55.382059+00:00 red kernel: [25931.934981] block drbd0: conn( WFConnection -> WFReportParams )
2011-05-05T15:11:55.382062+00:00 red kernel: [25931.935142] block drbd0: Starting asender thread (from drbd0_receiver [4934])
2011-05-05T15:11:55.392097+00:00 red kernel: [25931.935436] block drbd0: data-integrity-alg: sha1
2011-05-05T15:11:55.392107+00:00 red kernel: [25931.935448] block drbd0: max_segment_size ( = BIO size ) = 65536
2011-05-05T15:11:55.392110+00:00 red kernel: [25931.935545] block drbd0: drbd_sync_handshake:
2011-05-05T15:11:55.392113+00:00 red kernel: [25931.935598] block drbd0: self A89D201798F5DE80:0000000000000000:091B6F96533FDB02:DB024609978BDBC3 bits:0 flags:0
2011-05-05T15:11:55.392116+00:00 red kernel: [25931.935604] block drbd0: peer D1234DBED1F69961:A89D201798F5DE81:091B6F96533FDB03:DB024609978BDBC3 bits:38 flags:0
2011-05-05T15:11:55.392118+00:00 red kernel: [25931.935609] block drbd0: uuid_compare()=-1 by rule 50
2011-05-05T15:11:55.392121+00:00 red kernel: [25931.935616] block drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
2011-05-05T15:11:55.492063+00:00 red kernel: [25932.042144] block drbd0: conn( WFBitMapT -> WFSyncUUID )
2011-05-05T15:11:55.502164+00:00 red kernel: [25932.047547] block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0
2011-05-05T15:11:55.502173+00:00 red kernel: [25932.049414] block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 exit code 0 (0x0)
2011-05-05T15:11:55.502177+00:00 red kernel: [25932.049422] block drbd0: conn( WFSyncUUID -> SyncTarget ) disk( UpToDate -> Inconsistent )
2011-05-05T15:11:55.502180+00:00 red kernel: [25932.049430] block drbd0: Began resync as SyncTarget (will sync 152 KB [38 bits set]).
2011-05-05T15:11:55+00:00 red crm-unfence-peer.sh[26923]: invoked for home
2011-05-05T15:11:55.522131+00:00 red kernel: [25932.066275] block drbd0: Resync done (total 1 sec; paused 0 sec; 152 K/sec)
2011-05-05T15:11:55.522138+00:00 red kernel: [25932.066282] block drbd0: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate )
2011-05-05T15:11:55.522140+00:00 red kernel: [25932.066359] block drbd0: helper command: /sbin/drbdadm after-resync-target minor-0
2011-05-05T15:11:55+00:00 red cibadmin: [26927]: info: Invoked: cibadmin -Ql
2011-05-05T15:11:55.582244+00:00 red kernel: [25932.127429] block drbd0: helper command: /sbin/drbdadm after-resync-target minor-0 exit code 0 (0x0)
--
Eray Aslan