[DRBD-user] drbd with heartbeat doesnt sync both ways

Lars Ellenberg Lars.Ellenberg at linbit.com
Wed Sep 20 12:59:46 CEST 2006

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


/ 2006-09-18 03:36:30 +0200
\ Christophe Zwecker:
> Tim Jackson wrote:
> >Christophe Zwecker wrote:
> >>node1 is primary with mounted fs
> >>node2 is secondary
> >>
> >>nod1 goes down (only network failure),
> >"only" network failure? Which network? In many cases, a network failure alone is worse than one box completely 
> >failing, because it can cause "split brain" if you're not careful.
> 
> I plugged the network cable from node1, leaves the crossover cable between node1 + node2
> 
> >What connections do you have for Heartbeat to use? (A serial heartbeat is always a good idea if you can have it). 
> >As many redundant paths as possible is good. (typical might be 3: replication (crossover) network between the DRBD 
> >machines, "normal" network and serial heartbeat)
> 
> I use a crossover cable for testing, ill add serial for production
> 
> >>heartbeat unmounts the drbd fs on node1. node 2 takes over and mounts the drbd volume. 
> >And what happens to node1 here? Are you sure that Heartbeat stops the DRBD services? My guess is that you have a 
> >single network connection for both DRBD and Heartbeat, in which case DRBD will still be primary on node1.
> 
> yes heartbeat stops drbd on node1 and starts it on node2
> 
> heartbeat[17239]: 2006/09/15_15:08:42 WARN: node 192.168.1.254: is dead
> heartbeat[17239]: 2006/09/15_15:08:42 info: Link 192.168.1.254:192.168.1.254 dead.
> harc[18084]:    2006/09/15_15:08:42 info: Running /etc/ha.d/rc.d/status status
> heartbeat[17239]: 2006/09/15_15:08:54 info: mw-test-n1.i-dis.net wants to go standby [all]
> heartbeat[17239]: 2006/09/15_15:08:55 info: standby: mw-test-n2.i-dis.net can take our all resources
> heartbeat[18103]: 2006/09/15_15:08:55 info: give up all HA resources (standby).
> ResourceManager[18113]: 2006/09/15_15:08:55 info: Releasing resource group: mw-test-n1.i-dis.net drbddisk::ha 
> Filesystem::/dev/drbd0::/ha::ext3 192.168.1.123 httpd mysql
> ResourceManager[18113]: 2006/09/15_15:08:55 info: Running /etc/init.d/mysql  stop
> ResourceManager[18113]: 2006/09/15_15:08:59 info: Running /etc/init.d/httpd  stop
> ResourceManager[18113]: 2006/09/15_15:08:59 info: Running /etc/ha.d/resource.d/IPaddr 192.168.1.123 stop
> IPaddr[18295]:  2006/09/15_15:08:59 INFO: /sbin/route -n del -host 192.168.1.123
> IPaddr[18295]:  2006/09/15_15:08:59 INFO: /sbin/ifconfig eth0:0 192.168.1.123 down
> IPaddr[18295]:  2006/09/15_15:08:59 INFO: IP Address 192.168.1.123 released
> IPaddr[18225]:  2006/09/15_15:08:59 INFO: IPaddr Success
> ResourceManager[18113]: 2006/09/15_15:08:59 info: Running /etc/ha.d/resource.d/Filesystem /dev/drbd0 /ha ext3 stop
> Filesystem[18415]:      2006/09/15_15:09:00 INFO: Running stop for /dev/drbd0 on /ha
> Filesystem[18415]:      2006/09/15_15:09:00 INFO: unmounted /ha successfully
> Filesystem[18351]:      2006/09/15_15:09:00 INFO: Filesystem Success
> ResourceManager[18113]: 2006/09/15_15:09:00 info: Running /etc/ha.d/resource.d/drbddisk ha stop
> heartbeat[18103]: 2006/09/15_15:09:00 info: all HA resource release completed (standby).
> 
> 
> so on node1:
> [root at mw-test-n1 ~]# cat /proc/drbd
> version: 0.7.21 (api:79/proto:74)
> SVN Revision: 2326 build by root at mw-test-n1.i-dis.net, 2006-09-11 16:41:09
>  0: cs:WFConnection st:Secondary/Unknown ld:Consistent
>     ns:402708 nr:444 dw:403368 dr:14442 al:104 bm:381 lo:0 pe:0 ua:0 ap:0
> 
> 
> on node2:
> [root at mw-test-n2 ~]# cat /proc/drbd
> version: 0.7.21 (api:79/proto:74)
> SVN Revision: 2326 build by root at mw-test-n1.i-dis.net, 2006-09-11 16:41:09
>  0: cs:WFConnection st:Primary/Unknown ld:Consistent
>     ns:444 nr:402708 dw:403800 dr:12215 al:15 bm:18 lo:0 pe:0 ua:0 ap:0
> 
> >>node1 comes backup, mounts drbd volume and the change aint  there because:
> >>Sep 15 13:47:03 mw-test-n2 kernel: drbd0: Current Primary shall become sync TARGET! Aborting to prevent data 
> >>corruption.
> >DRBD is doing the right thing here. Either your nodes weren't really synchronised before the failure, or you had a 
> >split brain where DRBD was primary on both machines.
> 
> the data was synced for sure. could it be that the problem is, when node1 comes backup, on node1 drbd is switched 
> to primary beforce its being synced ?

you got a _split brain_ .
ok, not a heartbeat split brain, but even worse: a resource internal
split brain, causing diverging data sets.

while drbd was not connected (not able to communicate), both nodes have
been primary, able to change the data set independently. ok, they
aparently have not been primary at the same time in your case, but 
there is no way for drbd to tell that apart, and even then you still
got diverging data sets.

not much options for drbd to resolve that automagically.

-- 
: Lars Ellenberg                                  Tel +43-1-8178292-0  :
: LINBIT Information Technologies GmbH            Fax +43-1-8178292-82 :
: Schoenbrunner Str. 244, A-1120 Vienna/Europe   http://www.linbit.com :
__
please use the "List-Reply" function of your email client.



More information about the drbd-user mailing list