Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi *, I would like to get more info about the real live risk of data loss with DRBD using no-disk-barrier and no-disk-flushes on RAID controllers without BBU. If I understand things correctly, DRBD adds barriers into the data stream from primary to secondary (at least) on each flush of the underlying primary device. Without barrier support it flushes the secondary on each flush of the primary. This happens to make shure, subsequent operations that rely on the data to be commited on disk find the same state on the secondary in case of a failover. If I use a RAID controller with BBU, that takes care for all data that has reached the controller cache to survive (some) crashes or power failures. But what are the scenarios where I really suffer data loss without BBU? And is my risk of data loss hihger with DRBD than it would be without? The primary use case for DRBD as I see it is failure of one node in the cluster that leads to a failover to the secondary. In this case we have one survivor and this survivor has plenty of time to flush all data from the cache buffer to its disk before the failover proceeds. And reads would give me the cached data meanwhile. The benefit I get from the BBU in this situation is this flush time. After that time, the data on disk is exactly the same, so there is no additional protection against data corruption that might arise from faulty data sent by the primary during the crash. As soon as this data is in secondary cache it will be written to disk sooner or later. If this is correct so far, the remaining risk is simultanious (power-)failure of both nodes. If this happens, there are several causes of trouble. I suffer real service downtime although I have spent so much money for high availability. I might get asked why I did not spend the little extra money on independent UPS for both nodes. Data on the secondary might have been written out of order leading to an inconsistent state. On the primary, without BBU an queued flush might have succeeded or not, but the write order is correct. I will likely suffer data loss in this scenario, but there is no additional risk by using DRBD. On boot after (power-)recovery the primary needs a file system check to cleanup possible damage but this is exactly the same risk as in the standalone case. Even with BBU (on the primary) in this scenario I would rely on the primary data more than on the secondary. So the only case where I would really get extra reliablity from barriers and in order flushes on the secondary would be if only my secondary has a BBU and the primary does not. What is your opinion and possibly your experience with using no-disk-barrier and no-disk-flushes without BBU RAID? The reason for me asking is the huge latency I suffer using flushes in my setup where I run several virtual KVM instances in DRBD containers without BBU RAID. These virtual systems frequently flush disks and these operations occasionally queue up to a substantial epoch of 100 or even higher. Best regards, Sebastian