Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
We have a 2 node drbd (version 8.3.9) cluster. During failover last night, secondary lost some files when promoted to primary and we had to use our backups. I am trying to investigate it and noticed: green ~ # cat /proc/drbd | head -n4 version: 8.3.9 (api:88/proto:86-95) built-in 0: cs:Connected ro:Primary/Secondary ds:UpToDate/DUnknown C r----- ns:505732 nr:0 dw:1054924 dr:787077381 al:38 bm:391 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:188492 red ~ # cat /proc/drbd | head -n4 version: 8.3.9 (api:88/proto:86-95) built-in 0: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r----- ns:0 nr:358148 dw:505720 dr:786407892 al:0 bm:391 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:0 Notice that the states are different across the nodes (Unknown vs UpToDate). In particular, please note the following log entry (full logs below) which results in different states in nodes: 2011-05-05T15:11:56+00:00 green crm-fence-peer.sh[5873]: WARNING peer is Secondary, did not place the constraint! Googling lead to: http://www.engardelinux.org/modules/index/list_archives.cgi?list=drbd-user&page=0156.html&month=2011-02 http://www.gossamer-threads.com/lists/drbd/users/20605#20605 but no solution on what to do. I suppose I can change fencing to resource-and-stonith to see if it gives better results but freezing IO to solve a problem in secondary does not look like a good option. Or turn off digest integrity check but again this looks like a work around and not a real solution. crm-fence-peer.sh seems to be the culprit here. Any suggestions? drbd.conf and logs are below. Thanks. drbd.conf: global { usage-count yes; } common { protocol C; handlers { pri-on-incon-degr "/usr/lib64/drbd/notify-pri-on-incon-degr.sh; /usr/lib64/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f"; pri-lost-after-sb "/usr/lib64/drbd/notify-pri-lost-after-sb.sh; /usr/lib64/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f"; local-io-error "/usr/lib64/drbd/notify-io-error.sh; /usr/lib64/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f"; fence-peer "/usr/lib64/drbd/crm-fence-peer.sh"; split-brain "/usr/lib64/drbd/notify-split-brain.sh root"; after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh"; } syncer { rate 80M; verify-alg sha1; al-extents 3389; } net { data-integrity-alg sha1; after-sb-0pri discard-zero-changes; after-sb-1pri discard-secondary; max-buffers 8000; max-epoch-size 8000; } disk { on-io-error call-local-io-error; fencing resource-only; no-disk-barrier; no-disk-flushes; no-md-flushes; } } resource home { meta-disk internal; device /dev/drbd0; disk /dev/vg1/home; on green { address 10.250.0.120:7789; } on red { address 10.250.0.121:7789; } } Node 1: 2011-05-05T15:11:55.084132+00:00 green kernel: [25918.176005] block drbd0: sock was shut down by peer 2011-05-05T15:11:55.084139+00:00 green kernel: [25918.176011] block drbd0: peer( Secondary -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown ) 2011-05-05T15:11:55.084141+00:00 green kernel: [25918.176016] block drbd0: short read expecting header on sock: r=0 2011-05-05T15:11:55.084143+00:00 green kernel: [25918.176071] block drbd0: Creating new current UUID 2011-05-05T15:11:55.084144+00:00 green kernel: [25918.176093] block drbd0: meta connection shut down by peer. 2011-05-05T15:11:55.084145+00:00 green kernel: [25918.176095] block drbd0: asender terminated 2011-05-05T15:11:55.084146+00:00 green kernel: [25918.176097] block drbd0: Terminating drbd0_asender 2011-05-05T15:11:55.084147+00:00 green kernel: [25918.176261] block drbd0: Connection closed 2011-05-05T15:11:55.084148+00:00 green kernel: [25918.176353] block drbd0: conn( BrokenPipe -> Unconnected ) 2011-05-05T15:11:55.084149+00:00 green kernel: [25918.176355] block drbd0: helper command: /sbin/drbdadm fence-peer minor-0 2011-05-05T15:11:55.084150+00:00 green kernel: [25918.176360] block drbd0: receiver terminated 2011-05-05T15:11:55.084151+00:00 green kernel: [25918.176362] block drbd0: Restarting drbd0_receiver 2011-05-05T15:11:55.084152+00:00 green kernel: [25918.176363] block drbd0: receiver (re)started 2011-05-05T15:11:55.084153+00:00 green kernel: [25918.176366] block drbd0: conn( Unconnected -> WFConnection ) 2011-05-05T15:11:55+00:00 green crm-fence-peer.sh[5873]: invoked for home 2011-05-05T15:11:55+00:00 green cibadmin: [5877]: info: Invoked: cibadmin -Ql 2011-05-05T15:11:55+00:00 green cibadmin: [5886]: info: Invoked: cibadmin -Q -t 1 2011-05-05T15:11:55.384117+00:00 green kernel: [25918.472322] block drbd0: Handshake successful: Agreed network protocol version 95 2011-05-05T15:11:55.384133+00:00 green kernel: [25918.472333] block drbd0: conn( WFConnection -> WFReportParams ) 2011-05-05T15:11:55.384135+00:00 green kernel: [25918.472461] block drbd0: Starting asender thread (from drbd0_receiver [3494]) 2011-05-05T15:11:55.384137+00:00 green kernel: [25918.472966] block drbd0: data-integrity-alg: sha1 2011-05-05T15:11:55.384140+00:00 green kernel: [25918.473032] block drbd0: max_segment_size ( = BIO size ) = 65536 2011-05-05T15:11:55.384141+00:00 green kernel: [25918.473105] block drbd0: drbd_sync_handshake: 2011-05-05T15:11:55.384144+00:00 green kernel: [25918.473111] block drbd0: self D1234DBED1F69961:A89D201798F5DE81:091B6F96533FDB03:DB024609978BDBC3 bits:38 flags:0 2011-05-05T15:11:55.384147+00:00 green kernel: [25918.473117] block drbd0: peer A89D201798F5DE80:0000000000000000:091B6F96533FDB02:DB024609978BDBC3 bits:0 flags:0 2011-05-05T15:11:55.384149+00:00 green kernel: [25918.473122] block drbd0: uuid_compare()=1 by rule 70 2011-05-05T15:11:55.384152+00:00 green kernel: [25918.473129] block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> UpToDate ) 2011-05-05T15:11:55.494062+00:00 green kernel: [25918.584770] block drbd0: conn( WFBitMapS -> SyncSource ) pdsk( UpToDate -> Inconsistent ) 2011-05-05T15:11:55.494079+00:00 green kernel: [25918.584782] block drbd0: Began resync as SyncSource (will sync 152 KB [38 bits set]). 2011-05-05T15:11:55.514054+00:00 green kernel: [25918.603898] block drbd0: Resync done (total 1 sec; paused 0 sec; 152 K/sec) 2011-05-05T15:11:55.514065+00:00 green kernel: [25918.603907] block drbd0: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate ) 2011-05-05T15:11:56+00:00 green cibadmin: [5899]: info: Invoked: cibadmin -Q -t 1 2011-05-05T15:11:56+00:00 green crm-fence-peer.sh[5873]: WARNING peer is Secondary, did not place the constraint! 2011-05-05T15:11:56.244254+00:00 green kernel: [25919.336391] block drbd0: helper command: /sbin/drbdadm fence-peer minor-0 exit code 1 (0x100) 2011-05-05T15:11:56.244264+00:00 green kernel: [25919.336394] block drbd0: fence-peer helper broken, returned 1 2011-05-05T15:11:56.244265+00:00 green kernel: [25919.336399] block drbd0: pdsk( UpToDate -> DUnknown ) 2011-05-05T15:11:56.244266+00:00 green kernel: [25919.336465] block drbd0: Creating new current UUID Node 2: 2011-05-05T15:11:55.092182+00:00 red kernel: [25931.638216] block drbd0: Digest integrity check FAILED. 2011-05-05T15:11:55.092192+00:00 red kernel: [25931.638228] block drbd0: error receiving Data, l: 4140! 2011-05-05T15:11:55.092196+00:00 red kernel: [25931.638236] block drbd0: peer( Primary -> Unknown ) conn( Connected -> ProtocolError ) pdsk( UpToDate -> DUnknown ) 2011-05-05T15:11:55.092198+00:00 red kernel: [25931.638364] block drbd0: asender terminated 2011-05-05T15:11:55.092200+00:00 red kernel: [25931.638368] block drbd0: Terminating drbd0_asender 2011-05-05T15:11:55.092202+00:00 red kernel: [25931.638479] block drbd0: Connection closed 2011-05-05T15:11:55.092204+00:00 red kernel: [25931.638484] block drbd0: conn( ProtocolError -> Unconnected ) 2011-05-05T15:11:55.092206+00:00 red kernel: [25931.638487] block drbd0: receiver terminated 2011-05-05T15:11:55.092208+00:00 red kernel: [25931.638489] block drbd0: Restarting drbd0_receiver 2011-05-05T15:11:55.092210+00:00 red kernel: [25931.638491] block drbd0: receiver (re)started 2011-05-05T15:11:55.092212+00:00 red kernel: [25931.638494] block drbd0: conn( Unconnected -> WFConnection ) 2011-05-05T15:11:55.382048+00:00 red kernel: [25931.934973] block drbd0: Handshake successful: Agreed network protocol version 95 2011-05-05T15:11:55.382059+00:00 red kernel: [25931.934981] block drbd0: conn( WFConnection -> WFReportParams ) 2011-05-05T15:11:55.382062+00:00 red kernel: [25931.935142] block drbd0: Starting asender thread (from drbd0_receiver [4934]) 2011-05-05T15:11:55.392097+00:00 red kernel: [25931.935436] block drbd0: data-integrity-alg: sha1 2011-05-05T15:11:55.392107+00:00 red kernel: [25931.935448] block drbd0: max_segment_size ( = BIO size ) = 65536 2011-05-05T15:11:55.392110+00:00 red kernel: [25931.935545] block drbd0: drbd_sync_handshake: 2011-05-05T15:11:55.392113+00:00 red kernel: [25931.935598] block drbd0: self A89D201798F5DE80:0000000000000000:091B6F96533FDB02:DB024609978BDBC3 bits:0 flags:0 2011-05-05T15:11:55.392116+00:00 red kernel: [25931.935604] block drbd0: peer D1234DBED1F69961:A89D201798F5DE81:091B6F96533FDB03:DB024609978BDBC3 bits:38 flags:0 2011-05-05T15:11:55.392118+00:00 red kernel: [25931.935609] block drbd0: uuid_compare()=-1 by rule 50 2011-05-05T15:11:55.392121+00:00 red kernel: [25931.935616] block drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) 2011-05-05T15:11:55.492063+00:00 red kernel: [25932.042144] block drbd0: conn( WFBitMapT -> WFSyncUUID ) 2011-05-05T15:11:55.502164+00:00 red kernel: [25932.047547] block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 2011-05-05T15:11:55.502173+00:00 red kernel: [25932.049414] block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 exit code 0 (0x0) 2011-05-05T15:11:55.502177+00:00 red kernel: [25932.049422] block drbd0: conn( WFSyncUUID -> SyncTarget ) disk( UpToDate -> Inconsistent ) 2011-05-05T15:11:55.502180+00:00 red kernel: [25932.049430] block drbd0: Began resync as SyncTarget (will sync 152 KB [38 bits set]). 2011-05-05T15:11:55+00:00 red crm-unfence-peer.sh[26923]: invoked for home 2011-05-05T15:11:55.522131+00:00 red kernel: [25932.066275] block drbd0: Resync done (total 1 sec; paused 0 sec; 152 K/sec) 2011-05-05T15:11:55.522138+00:00 red kernel: [25932.066282] block drbd0: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate ) 2011-05-05T15:11:55.522140+00:00 red kernel: [25932.066359] block drbd0: helper command: /sbin/drbdadm after-resync-target minor-0 2011-05-05T15:11:55+00:00 red cibadmin: [26927]: info: Invoked: cibadmin -Ql 2011-05-05T15:11:55.582244+00:00 red kernel: [25932.127429] block drbd0: helper command: /sbin/drbdadm after-resync-target minor-0 exit code 0 (0x0) -- Eray Aslan