Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi, I think I found what is arguably a small bug (cosmetic, not functional) in DRBD 0.7.19 (haven't checked in newer versions yet so it may already be fixed although the changelog doesn't seem to mention it). It's minor but makes analysing logs rather confusing. Let's look at the resource statuses (not connection statuses) that are logged during a typical failover scenario. We have node1 and node2. At T=0, the nodes are Secondary/Secondary and have consistent data. At T=1, node2 goes primary. All fine. At T=2, node2 fails (cstate=Unconnected) At T=3, node1 is made primary At T=4, node2 recovers Here are example log entries that DRBD 0.7.19 produces on node1 for the above scenario. Note that I am just showing the resource status transitions. ------------------------------------------------------------- T0. [ nodes are consistent ] T1. node1 kernel: drbdX: Secondary/Secondary --> Secondary/Primary T2. [ node2 fails here, cstate=Unconnected ] T3. node1 kernel: drbdX: Secondary/Unknown --> Primary/Unknown T4. node1 kernel: drbdX: Primary/Unknown --> Primary/Secondary ------------------------------------------------------------- However, note that the status at the end of the log line for T=1 is not the same as the status at the start of the T=3 log line. To my mind, the log is missing a line at T=2. It's true that at T=2, there will be lots of logging of cstate (cstate will rapidly go from Connected -> NetworkFailure -> BrokenPipe -> Unconnected -> WFConnection) but I think there should also be a log of the resource state. With this, the log would read as follows: ------------------------------------------------------------- T0. [ nodes are consistent ] T1. node1 kernel: drbdX: Secondary/Secondary --> Secondary/Primary T2. node1 kernel: drbdX: Secondary/Primary --> Secondary/Unknown T3. node1 kernel: drbdX: Secondary/Unknown --> Primary/Unknown T4. node1 kernel: drbdX: Primary/Unknown --> Primary/Secondary ------------------------------------------------------------- That way, it makes more sense and doesn't look like there is something missing. It makes it easier to verify the exact sequence of events, especially if there is quite a long time (e.g. months) between T1 and T3 (which there often is, because a box may be stable for months and then have a failover. In a worst case, the logs for T=1 may even have been rotated away.) It would be nice to easily verify that the status really was Secondary/Primary just before T=2. Tim