Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello list, I described this problem before on the Pacemaker mailing list. I am currently experimenting in my test lab in setting up a DRBD dual primary environment. I am running DRBD 8.4.5 on top of Debian Wheezy with Pacemaker 1.1.7. The problem that I see is that pacemaker in some situations promoting my drbd resources to primary so fast that they are not yet connected and after connection they are both Primary and recognize a split brain situation and disconnect immediately. I have done some research on this problem and came to the conclusion that it is related to the Master Score the drbd RA is reporting. E.g. I stop a resource via Pacemaker (the resource initially is in Primary/Primary state). Afterwards I try to fire up the resource again and face a split brain. I can avoid this setting a location constraint for the Master role with a rather low score of <-1000. This is due to the fact the DRBD RA on both nodes reports a score of 1000 as soon as the resource is started on both sides. So here comes my question. According to the RA meta data description which states adjust_master_score (string, [5 10 1000 10000]): master score adjustments Space separated list of four master score adjustments for different scenarios: - only access to 'consistent' data - only remote access to 'uptodate' data - currently Secondary, local access to 'uptodate' data, but remote is unknown 1000 is by default reported if the Resource has uptodate data. Is it intended that a resource that has been disconnected gracefully before reports uptodate data after restarting it even if the other node was still there when disconnecting? So is this intended behaviour? I mean at least one node should assume that the other might most probably have newer data. This also happens if I e.g. set one node into standby, reboot it and let it rejoin the cluster. Again Pacemaker fires up Primary mode almost instantly- when DRBD is still in WFConnection state and afterwards split brain is detected. Again I get a MS of 1000 from the RA here. This is a bit odd for me as I can not seriously use the "only remote access to 'uptodate' data" state as it is scored between two options which kill my cluster. I also tried using the stop_outdates_secondary="true" option which I assumed would outdate the data on the secondary on any stop action and afterwards it should report a MS of 5 according to the documentation but this seems to do nothing for me too. I know it is called outdates SECONDARY but for a short moment on stopping the resource should be secondary too if I see this correctly. I can provide reference to log files if needed. thank you for any hints in advance, regards, Felix