[DRBD-user] problem with failover and inconsistent states

Eray Aslan eray.aslan at caf.com.tr
Fri May 6 08:52:24 CEST 2011

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


We have a 2 node drbd (version 8.3.9) cluster.  During failover last
night, secondary lost some files when promoted to primary and we had to
use our backups.  I am trying to investigate it and noticed:

green ~ # cat /proc/drbd | head -n4
version: 8.3.9 (api:88/proto:86-95)
built-in
 0: cs:Connected ro:Primary/Secondary ds:UpToDate/DUnknown C r-----
     ns:505732 nr:0 dw:1054924 dr:787077381 al:38 bm:391 lo:0 pe:0 ua:0
     ap:0 ep:1 wo:d oos:188492
red ~ # cat /proc/drbd | head -n4
version: 8.3.9 (api:88/proto:86-95)
built-in
 0: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-----
     ns:0 nr:358148 dw:505720 dr:786407892 al:0 bm:391 lo:0 pe:0 ua:0
     ap:0 ep:1 wo:d oos:0

Notice that the states are different across the nodes (Unknown vs
UpToDate).  In particular, please note the following log entry (full
logs below) which results in different states in nodes:

2011-05-05T15:11:56+00:00 green crm-fence-peer.sh[5873]: WARNING peer is
Secondary, did not place the constraint!

Googling lead to:
http://www.engardelinux.org/modules/index/list_archives.cgi?list=drbd-user&page=0156.html&month=2011-02
http://www.gossamer-threads.com/lists/drbd/users/20605#20605

but no solution on what to do.  I suppose I can change fencing to
resource-and-stonith to see if it gives better results but freezing IO
to solve a problem in secondary does not look like a good option.  Or
turn off digest integrity check but again this looks like a work around
and not a real solution.  crm-fence-peer.sh seems to be the culprit
here.

Any suggestions?  drbd.conf and logs are below.  Thanks.


drbd.conf:
global { 
	usage-count yes;
}
common {
	protocol C;

	handlers {
		pri-on-incon-degr "/usr/lib64/drbd/notify-pri-on-incon-degr.sh; /usr/lib64/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
		pri-lost-after-sb "/usr/lib64/drbd/notify-pri-lost-after-sb.sh; /usr/lib64/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
		local-io-error "/usr/lib64/drbd/notify-io-error.sh; /usr/lib64/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f";
		fence-peer "/usr/lib64/drbd/crm-fence-peer.sh";
		split-brain "/usr/lib64/drbd/notify-split-brain.sh root";
		after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
	}
	syncer {
		rate 80M;
		verify-alg sha1;
		al-extents 3389;
	}
	net {
		data-integrity-alg sha1;
		after-sb-0pri discard-zero-changes;
		after-sb-1pri discard-secondary;
		max-buffers 8000;
		max-epoch-size 8000;
	}
	disk {
		on-io-error call-local-io-error;
		fencing resource-only;
		no-disk-barrier;
		no-disk-flushes;
		no-md-flushes;
	}
}

resource home {
	meta-disk internal;
	device    /dev/drbd0;
	disk /dev/vg1/home;
	on green {
		address 10.250.0.120:7789;
	}
	on red {
		address 10.250.0.121:7789;
	}
}


Node 1:
2011-05-05T15:11:55.084132+00:00 green kernel: [25918.176005] block drbd0: sock was shut down by peer
2011-05-05T15:11:55.084139+00:00 green kernel: [25918.176011] block drbd0: peer( Secondary -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown ) 
2011-05-05T15:11:55.084141+00:00 green kernel: [25918.176016] block drbd0: short read expecting header on sock: r=0
2011-05-05T15:11:55.084143+00:00 green kernel: [25918.176071] block drbd0: Creating new current UUID
2011-05-05T15:11:55.084144+00:00 green kernel: [25918.176093] block drbd0: meta connection shut down by peer.
2011-05-05T15:11:55.084145+00:00 green kernel: [25918.176095] block drbd0: asender terminated
2011-05-05T15:11:55.084146+00:00 green kernel: [25918.176097] block drbd0: Terminating drbd0_asender
2011-05-05T15:11:55.084147+00:00 green kernel: [25918.176261] block drbd0: Connection closed
2011-05-05T15:11:55.084148+00:00 green kernel: [25918.176353] block drbd0: conn( BrokenPipe -> Unconnected ) 
2011-05-05T15:11:55.084149+00:00 green kernel: [25918.176355] block drbd0: helper command: /sbin/drbdadm fence-peer minor-0
2011-05-05T15:11:55.084150+00:00 green kernel: [25918.176360] block drbd0: receiver terminated
2011-05-05T15:11:55.084151+00:00 green kernel: [25918.176362] block drbd0: Restarting drbd0_receiver
2011-05-05T15:11:55.084152+00:00 green kernel: [25918.176363] block drbd0: receiver (re)started
2011-05-05T15:11:55.084153+00:00 green kernel: [25918.176366] block drbd0: conn( Unconnected -> WFConnection ) 
2011-05-05T15:11:55+00:00 green crm-fence-peer.sh[5873]: invoked for home
2011-05-05T15:11:55+00:00 green cibadmin: [5877]: info: Invoked: cibadmin -Ql 
2011-05-05T15:11:55+00:00 green cibadmin: [5886]: info: Invoked: cibadmin -Q -t 1 
2011-05-05T15:11:55.384117+00:00 green kernel: [25918.472322] block drbd0: Handshake successful: Agreed network protocol version 95
2011-05-05T15:11:55.384133+00:00 green kernel: [25918.472333] block drbd0: conn( WFConnection -> WFReportParams ) 
2011-05-05T15:11:55.384135+00:00 green kernel: [25918.472461] block drbd0: Starting asender thread (from drbd0_receiver [3494])
2011-05-05T15:11:55.384137+00:00 green kernel: [25918.472966] block drbd0: data-integrity-alg: sha1
2011-05-05T15:11:55.384140+00:00 green kernel: [25918.473032] block drbd0: max_segment_size ( = BIO size ) = 65536
2011-05-05T15:11:55.384141+00:00 green kernel: [25918.473105] block drbd0: drbd_sync_handshake:
2011-05-05T15:11:55.384144+00:00 green kernel: [25918.473111] block drbd0: self D1234DBED1F69961:A89D201798F5DE81:091B6F96533FDB03:DB024609978BDBC3 bits:38 flags:0
2011-05-05T15:11:55.384147+00:00 green kernel: [25918.473117] block drbd0: peer A89D201798F5DE80:0000000000000000:091B6F96533FDB02:DB024609978BDBC3 bits:0 flags:0
2011-05-05T15:11:55.384149+00:00 green kernel: [25918.473122] block drbd0: uuid_compare()=1 by rule 70
2011-05-05T15:11:55.384152+00:00 green kernel: [25918.473129] block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> UpToDate ) 
2011-05-05T15:11:55.494062+00:00 green kernel: [25918.584770] block drbd0: conn( WFBitMapS -> SyncSource ) pdsk( UpToDate -> Inconsistent ) 
2011-05-05T15:11:55.494079+00:00 green kernel: [25918.584782] block drbd0: Began resync as SyncSource (will sync 152 KB [38 bits set]).
2011-05-05T15:11:55.514054+00:00 green kernel: [25918.603898] block drbd0: Resync done (total 1 sec; paused 0 sec; 152 K/sec)
2011-05-05T15:11:55.514065+00:00 green kernel: [25918.603907] block drbd0: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate ) 
2011-05-05T15:11:56+00:00 green cibadmin: [5899]: info: Invoked: cibadmin -Q -t 1 
2011-05-05T15:11:56+00:00 green crm-fence-peer.sh[5873]: WARNING peer is Secondary, did not place the constraint!
2011-05-05T15:11:56.244254+00:00 green kernel: [25919.336391] block drbd0: helper command: /sbin/drbdadm fence-peer minor-0 exit code 1 (0x100)
2011-05-05T15:11:56.244264+00:00 green kernel: [25919.336394] block drbd0: fence-peer helper broken, returned 1
2011-05-05T15:11:56.244265+00:00 green kernel: [25919.336399] block drbd0: pdsk( UpToDate -> DUnknown ) 
2011-05-05T15:11:56.244266+00:00 green kernel: [25919.336465] block drbd0: Creating new current UUID


Node 2:
2011-05-05T15:11:55.092182+00:00 red kernel: [25931.638216] block drbd0: Digest integrity check FAILED.
2011-05-05T15:11:55.092192+00:00 red kernel: [25931.638228] block drbd0: error receiving Data, l: 4140!
2011-05-05T15:11:55.092196+00:00 red kernel: [25931.638236] block drbd0: peer( Primary -> Unknown ) conn( Connected -> ProtocolError ) pdsk( UpToDate -> DUnknown ) 
2011-05-05T15:11:55.092198+00:00 red kernel: [25931.638364] block drbd0: asender terminated
2011-05-05T15:11:55.092200+00:00 red kernel: [25931.638368] block drbd0: Terminating drbd0_asender
2011-05-05T15:11:55.092202+00:00 red kernel: [25931.638479] block drbd0: Connection closed
2011-05-05T15:11:55.092204+00:00 red kernel: [25931.638484] block drbd0: conn( ProtocolError -> Unconnected ) 
2011-05-05T15:11:55.092206+00:00 red kernel: [25931.638487] block drbd0: receiver terminated
2011-05-05T15:11:55.092208+00:00 red kernel: [25931.638489] block drbd0: Restarting drbd0_receiver
2011-05-05T15:11:55.092210+00:00 red kernel: [25931.638491] block drbd0: receiver (re)started
2011-05-05T15:11:55.092212+00:00 red kernel: [25931.638494] block drbd0: conn( Unconnected -> WFConnection ) 
2011-05-05T15:11:55.382048+00:00 red kernel: [25931.934973] block drbd0: Handshake successful: Agreed network protocol version 95
2011-05-05T15:11:55.382059+00:00 red kernel: [25931.934981] block drbd0: conn( WFConnection -> WFReportParams ) 
2011-05-05T15:11:55.382062+00:00 red kernel: [25931.935142] block drbd0: Starting asender thread (from drbd0_receiver [4934])
2011-05-05T15:11:55.392097+00:00 red kernel: [25931.935436] block drbd0: data-integrity-alg: sha1
2011-05-05T15:11:55.392107+00:00 red kernel: [25931.935448] block drbd0: max_segment_size ( = BIO size ) = 65536
2011-05-05T15:11:55.392110+00:00 red kernel: [25931.935545] block drbd0: drbd_sync_handshake:
2011-05-05T15:11:55.392113+00:00 red kernel: [25931.935598] block drbd0: self A89D201798F5DE80:0000000000000000:091B6F96533FDB02:DB024609978BDBC3 bits:0 flags:0
2011-05-05T15:11:55.392116+00:00 red kernel: [25931.935604] block drbd0: peer D1234DBED1F69961:A89D201798F5DE81:091B6F96533FDB03:DB024609978BDBC3 bits:38 flags:0
2011-05-05T15:11:55.392118+00:00 red kernel: [25931.935609] block drbd0: uuid_compare()=-1 by rule 50
2011-05-05T15:11:55.392121+00:00 red kernel: [25931.935616] block drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) 
2011-05-05T15:11:55.492063+00:00 red kernel: [25932.042144] block drbd0: conn( WFBitMapT -> WFSyncUUID ) 
2011-05-05T15:11:55.502164+00:00 red kernel: [25932.047547] block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0
2011-05-05T15:11:55.502173+00:00 red kernel: [25932.049414] block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 exit code 0 (0x0)
2011-05-05T15:11:55.502177+00:00 red kernel: [25932.049422] block drbd0: conn( WFSyncUUID -> SyncTarget ) disk( UpToDate -> Inconsistent ) 
2011-05-05T15:11:55.502180+00:00 red kernel: [25932.049430] block drbd0: Began resync as SyncTarget (will sync 152 KB [38 bits set]).
2011-05-05T15:11:55+00:00 red crm-unfence-peer.sh[26923]: invoked for home
2011-05-05T15:11:55.522131+00:00 red kernel: [25932.066275] block drbd0: Resync done (total 1 sec; paused 0 sec; 152 K/sec)
2011-05-05T15:11:55.522138+00:00 red kernel: [25932.066282] block drbd0: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate ) 
2011-05-05T15:11:55.522140+00:00 red kernel: [25932.066359] block drbd0: helper command: /sbin/drbdadm after-resync-target minor-0
2011-05-05T15:11:55+00:00 red cibadmin: [26927]: info: Invoked: cibadmin -Ql 
2011-05-05T15:11:55.582244+00:00 red kernel: [25932.127429] block drbd0: helper command: /sbin/drbdadm after-resync-target minor-0 exit code 0 (0x0)
-- 
Eray Aslan



More information about the drbd-user mailing list