Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
/ 2005-04-29 10:19:23 +0200 \ Anquijix Schiptara: > Hi there! > > i'm running heartbeat with drbd mirroring. > the heartbeats are being sent by eth1 and ttys0. the drbd devices are mirrored over eht1, too. > if eth1 breaks, no more data is being mirrored, but the heartbeats still get over ttys0. is that a problem? > imagine, the server crashes after eth1 was broken. the slave overtakes the services, but the data isn't up-to-date, because > mirroring doesn't work anymore. > so ive lost some important data. > > is there a possibility to prevent this problem? Difficult. We have some concepts for the next generation of drbd and heartbeat (because obviously drbd and the cluster manager must cooperate on this). It involves "resource fencing". Basically, whenever drbd loses its peer, it freezes in mid air, waiting for someone (the cluster manager) to confirm that the peer has been "fenced". If I am Primary, and I am losing my peer, I will suspend IO Why? Because the peer might be about to shoot me, because it thinks I am dead -- to actually _know_ it has to use STONITH before it takes over services. In the time between losing connection and being shot, I could confirm transactions to applications, which would then be lost after takeover. Therefore I suspend IO, and wait for the cluster manager to tell me "hey, I made sure that the other node won't stonith you, and will not try to take over. please resume operation". Then I resume. If I am Secondary, and I am losing my peer, I will refuse to do anything, until the cluster manager confirms that I am still up-to-date, and the former Primary has been "fenced", and I must go in Primary state myself now. To "fence" the peer, the cluster manager need to still have some communication between the nodes. It has to recognize DRBD-connection loss (or more generally speaking, degradation of a replicated and/or distributed resource), and then decide * to tell the current Secondary it is outdated, then resume operation on the Primary, OR * tell the current Primary to go Secondary and flag itself outdated, then resume former Secondary and promote it to Primary. Typically services/users running on top of the current Primary need to be killed hard, possible suspended in-flight IO needs to be canceled; still, it might be a better choice: that node might have lost communications to the clients; and just killing services might be cheaper than a full STONITH and reboot operation. If the cluster manager itself has no communication anymore, it need to shoot the other node. Who ever is successful can resume locally. And some more complications... But with current DBRD and Heartbeat, there is no such thing. You could however script around it somehow, heavily relying on timeouts, with lots of unlikely race conditions which will bite you nevertheless because if it can happen, it will... > i thought about automatically change the mirror device to eth0, when eth1 fails. use a bonding device, for a start see /usr/src/linux/Documentation/networking/bonding.txt Lars Ellenberg -- please use the "List-Reply" function of your email client.