Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Lars, I think that we have to separate the issues (or maybe here is my mistake). All the configuration issues pointed by you are from my understanding related only to what happen only after a split brain is detected/occured. And split brain could occur due to a lot of facts. Yes, since the servers are collocated and they share a separate network connection, this should be very unlikely, but not impossible. So it should not influence the mecanism of the synchronization. I reproduce here the issue (some not important or confidential data trimmed/changed). If you need I could send you ssh details to look into the boxes. Shortly, I changed a value in a file on /data where drbd is mounted, copied this on a normal disk and also on a normal disk of the peer. (My comments between ##############) 1_mail ~ # nano /data/chroot/dns/etc/bind/pri/adomain-static.eu.zone ############# Here I edited this file, changing the serial number from 2010030801 to 2010030802 ############# 1_mail ~ # cp /data/chroot/dns/etc/bind/pri/adomain-static.eu.zone /home/adomain-static.eu.zone 1_mail ~ # diff /data/chroot/dns/etc/bind/pri/adomain-static.eu.zone /home/adomain-static.eu.zone 1_mail ~ # rsync /data/chroot/dns/etc/bind/pri/adomain-static.eu.zone 2_mail:/home 1_mail ~ # cat /proc/drbd version: 8.3.7 (api:88/proto:86-91) GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by root at 1_mail, 2010-03-04 01:20:09 0: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r---- ns:17389628 nr:36296 dw:17425988 dr:135950 al:188 bm:118 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:0 1_mail ~ # ssh 2_mail cat /proc/drbd version: 8.3.7 (api:88/proto:86-91) GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by root at 2_mail, 2010-03-03 21:46:58 0: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r---- ns:26104 nr:17391104 dw:17417208 dr:11941 al:57 bm:59 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:0 1_mail ~ # shutdown -r now ############################# verified that drbd say it is in sync and perform a restart on the primary server immediatly all resources moved and the second server become primary ############################# 2_mail ~ # diff /data/chroot/dns/etc/bind/pri/adomain-static.eu.zone /home/adomain-static.eu.zone 6c6 < 2010012204 --- > 2010030802 ################################### It has a VERY old version!!!!! Actualy it was a version which was edited when this server was primary and was synced be mannualy declare the other one out of sync Meanwhile I edited several times this file, all times when 1_mail was primary ################################### 2_mail ~ # cat /proc/drbd version: 8.3.7 (api:88/proto:86-91) GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by root at 2_mail, 2010-03-03 21:46:58 0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/Outdated C r---- ns:26104 nr:17398616 dw:17425392 dr:13246 al:59 bm:59 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:668 ############ the other one is not up yet....... So no strange sync between them ############ 2_mail ~ # crm_mon -1f ============ Last updated: Tue Mar 9 16:46:01 2010 Stack: Heartbeat Current DC: 2_mail (5102bd2d-aef8-4b93-b01f-c5396c7dec41) - partition with quorum Version: 1.0.7-2eed906f43e90ee1e0f7d411f814fc585b30f869 2 Nodes configured, 2 expected votes 6 Resources configured. ============ Online: [ 2_mail ] OFFLINE: [ 1_mail ] Master/Slave Set: ms_drbd Masters: [ 2_mail ] Stopped: [ drbd:0 ] Resource Group: HA fs (ocf::heartbeat:Filesystem): Started 2_mail ip (ocf::heartbeat:IPaddr2): Started 2_mail ip_flash (ocf::heartbeat:IPaddr2): Started 2_mail nginx_flash_ld (ocf::heartbeat:ldirectord): Started 2_mail pgsql (ocf::heartbeat:pgsql): Started 2_mail named (lsb:named): Stopped Migration summary: * Node 2_mail: named: migration-threshold=2 fail-count=2 Failed actions: named_start_0 (node=2_mail, call=386, rc=1, status=complete): unknown error Mar 9 16:44:43 2_mail lrmd: [5879]: info: rsc:named:384: start Mar 9 16:44:43 2_mail crmd: [5882]: info: do_lrm_rsc_op: Performing key=57:0:0:0d15bbaf-4cda-4219-8e04-6667546ff9f1 op=named_start_0 ) Mar 9 16:44:43 2_mail lrmd: [5879]: info: RA output: (named:start:stdout) * Starting chrooted named ... Mar 9 16:44:43 2_mail lrmd: [5879]: info: RA output: (named:start:stdout) * Mounting chroot dirs Mar 9 16:44:43 2_mail lrmd: [5879]: info: RA output: (named:start:stdout) * mounting /etc/bind to /data/chroot/dns/etc/bind Mar 9 16:44:43 2_mail lrmd: [5879]: info: RA output: (named:start:stdout) * mounting /var/bind to /data/chroot/dns/var/bind Mar 9 16:44:43 2_mail lrmd: [5879]: info: RA output: (named:start:stdout) * mounting /var/log/named to /data/chroot/dns/var/log/named Mar 9 16:44:43 2_mail named[4048]: starting BIND 9.6.1-P1 -u named -n 1 -t /data/chroot/dns Mar 9 16:44:43 2_mail named[4048]: built with '--prefix=/usr' '--build=x86_64-pc-linux-gnu' '--host=x86_64-pc-linux-gnu' '--mandir=/usr/share/man' '--infodir=/usr/share/info' '--datadir=/usr/share' '--sysconfdir=/etc' '--localstatedir=/var/lib' '--libdir=/usr/lib64' '--sysconfdir=/etc/bind' '--localstatedir=/var' '--with-libtool' '--with-openssl' '--without-idn' '--disable-ipv6' '--without-libxml2' '--enable-linux-caps' '--enable-threads' '--with-randomdev=/dev/random' 'build_alias=x86_64-pc-linux-gnu' 'host_alias=x86_64-pc-linux-gnu' 'CFLAGS=-O2 -march=nocona -pipe' 'LDFLAGS=-Wl,-O1' 'CXXFLAGS=-O2 -march=nocona -pipe' Mar 9 16:44:43 2_mail named[4048]: adjusted limit on open files from 1024 to 1048576 Mar 9 16:44:43 2_mail named[4048]: found 2 CPUs, using 1 worker thread Mar 9 16:44:43 2_mail named[4048]: using up to 4096 sockets Mar 9 16:44:43 2_mail named[4048]: loading configuration from '/etc/bind/named.conf' Mar 9 16:44:43 2_mail named[4048]: /etc/bind/named.conf:26: expected IPv4 address or '*' near '{' Mar 9 16:44:43 2_mail named[4048]: loading configuration: unexpected token Mar 9 16:44:43 2_mail named[4048]: exiting (due to fatal error) Mar 9 16:44:43 2_mail lrmd: [5879]: info: RA output: (named:start:stdout) [ !! ] Mar 9 16:44:43 2_mail lrmd: [5879]: WARN: Managed named:start process 3974 exited with return code 1. ######## The named failed to start because the configuration file residing on the drbd partition is an old one with errors in it! Interestingly this config file was modified 2 days ago into the wrong version, so long after the zone file above (with different serial numbers) was edited. So here it is what I tried to explain earlier, that there is no time based rule of when the files were modified and synced after. ######## 1_mail ~ # cat /proc/drbd version: 8.3.7 (api:88/proto:86-91) GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by root at 1_mail, 2010-03-04 01:20:09 0: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r---- ns:32144 nr:940 dw:33084 dr:12505 al:56 bm:13 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0 Last updated: Tue Mar 9 16:56:08 2010 1_mail ~ #crm_mon -1f Stack: Heartbeat Current DC: 2_mail (5102bd2d-aef8-4b93-b01f-c5396c7dec41) - partition with quorum Version: 1.0.7-2eed906f43e90ee1e0f7d411f814fc585b30f869 2 Nodes configured, 2 expected votes 6 Resources configured. ============ Online: [ 1_mail 2_mail ] Master/Slave Set: ms_drbd Masters: [ 1_mail ] Slaves: [ 2_mail ] Resource Group: HA fs (ocf::heartbeat:Filesystem): Started 1_mail ip (ocf::heartbeat:IPaddr2): Started 1_mail ip_flash (ocf::heartbeat:IPaddr2): Started 1_mail nginx_flash_ld (ocf::heartbeat:ldirectord): Started 1_mail pgsql (ocf::heartbeat:pgsql): Started 1_mail named (lsb:named): Started 1_mail Migration summary: * Node 2_mail: named: migration-threshold=2 fail-count=2 * Node 1_mail: Failed actions: named_start_0 (node=2_mail, call=386, rc=1, status=complete): unknown error ####### When the 1_mail wake up the resources were migrated to it, with the correct file in place named conf and named zone file. diff is ok, named started. ####### -- View this message in context: http://old.nabble.com/drbd-master-to-slave-synchronisation-under-heartbeat-tp27824570p27837328.html Sent from the DRBD - User mailing list archive at Nabble.com.