[DRBD-user] drbd with heartbeat doesnt sync both ways

Christophe Zwecker doc at zwecker.de
Mon Sep 18 03:36:30 CEST 2006

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Tim Jackson wrote:
> Christophe Zwecker wrote:
> 
>> node1 is primary with mounted fs
>> node2 is secondary
>>
>> nod1 goes down (only network failure),
> 
> "only" network failure? Which network? In many cases, a network failure 
> alone is worse than one box completely failing, because it can cause 
> "split brain" if you're not careful.

I plugged the network cable from node1, leaves the crossover cable 
between node1 + node2

> What connections do you have for Heartbeat to use? (A serial heartbeat 
> is always a good idea if you can have it). As many redundant paths as 
> possible is good. (typical might be 3: replication (crossover) network 
> between the DRBD machines, "normal" network and serial heartbeat)

I use a crossover cable for testing, ill add serial for production

>> heartbeat unmounts the drbd fs on node1. node 2 takes over and mounts 
>> the drbd volume. 
> 
> And what happens to node1 here? Are you sure that Heartbeat stops the 
> DRBD services? My guess is that you have a single network connection for 
> both DRBD and Heartbeat, in which case DRBD will still be primary on node1.

yes heartbeat stops drbd on node1 and starts it on node2

heartbeat[17239]: 2006/09/15_15:08:42 WARN: node 192.168.1.254: is dead
heartbeat[17239]: 2006/09/15_15:08:42 info: Link 
192.168.1.254:192.168.1.254 dead.
harc[18084]:    2006/09/15_15:08:42 info: Running /etc/ha.d/rc.d/status 
status
heartbeat[17239]: 2006/09/15_15:08:54 info: mw-test-n1.i-dis.net wants 
to go standby [all]
heartbeat[17239]: 2006/09/15_15:08:55 info: standby: 
mw-test-n2.i-dis.net can take our all resources
heartbeat[18103]: 2006/09/15_15:08:55 info: give up all HA resources 
(standby).
ResourceManager[18113]: 2006/09/15_15:08:55 info: Releasing resource 
group: mw-test-n1.i-dis.net drbddisk::ha 
Filesystem::/dev/drbd0::/ha::ext3 192.168.1.123 httpd mysql
ResourceManager[18113]: 2006/09/15_15:08:55 info: Running 
/etc/init.d/mysql  stop
ResourceManager[18113]: 2006/09/15_15:08:59 info: Running 
/etc/init.d/httpd  stop
ResourceManager[18113]: 2006/09/15_15:08:59 info: Running 
/etc/ha.d/resource.d/IPaddr 192.168.1.123 stop
IPaddr[18295]:  2006/09/15_15:08:59 INFO: /sbin/route -n del -host 
192.168.1.123
IPaddr[18295]:  2006/09/15_15:08:59 INFO: /sbin/ifconfig eth0:0 
192.168.1.123 down
IPaddr[18295]:  2006/09/15_15:08:59 INFO: IP Address 192.168.1.123 released
IPaddr[18225]:  2006/09/15_15:08:59 INFO: IPaddr Success
ResourceManager[18113]: 2006/09/15_15:08:59 info: Running 
/etc/ha.d/resource.d/Filesystem /dev/drbd0 /ha ext3 stop
Filesystem[18415]:      2006/09/15_15:09:00 INFO: Running stop for 
/dev/drbd0 on /ha
Filesystem[18415]:      2006/09/15_15:09:00 INFO: unmounted /ha successfully
Filesystem[18351]:      2006/09/15_15:09:00 INFO: Filesystem Success
ResourceManager[18113]: 2006/09/15_15:09:00 info: Running 
/etc/ha.d/resource.d/drbddisk ha stop
heartbeat[18103]: 2006/09/15_15:09:00 info: all HA resource release 
completed (standby).


so on node1:
[root at mw-test-n1 ~]# cat /proc/drbd
version: 0.7.21 (api:79/proto:74)
SVN Revision: 2326 build by root at mw-test-n1.i-dis.net, 2006-09-11 16:41:09
  0: cs:WFConnection st:Secondary/Unknown ld:Consistent
     ns:402708 nr:444 dw:403368 dr:14442 al:104 bm:381 lo:0 pe:0 ua:0 ap:0


on node2:
[root at mw-test-n2 ~]# cat /proc/drbd
version: 0.7.21 (api:79/proto:74)
SVN Revision: 2326 build by root at mw-test-n1.i-dis.net, 2006-09-11 16:41:09
  0: cs:WFConnection st:Primary/Unknown ld:Consistent
     ns:444 nr:402708 dw:403800 dr:12215 al:15 bm:18 lo:0 pe:0 ua:0 ap:0

>> node1 comes backup, mounts drbd volume and the change aint  there 
>> because:
>> Sep 15 13:47:03 mw-test-n2 kernel: drbd0: Current Primary shall become 
>> sync TARGET! Aborting to prevent data corruption.
> 
> DRBD is doing the right thing here. Either your nodes weren't really 
> synchronised before the failure, or you had a split brain where DRBD was 
> primary on both machines.

the data was synced for sure. could it be that the problem is, when 
node1 comes backup, on node1 drbd is switched to primary beforce its 
being synced ?


-- 
Christophe Zwecker
:Sysctl
Koppel 96
20099 Hamburg
phon: +49 40 41263790
  fax: +49 40 41263799
mail: czwecker at sysctl.de



More information about the drbd-user mailing list