[DRBD-user] Logging bug in DRBD 0.7.19

Fri Sep 15 15:27:50 CEST 2006

Hi,

I think I found what is arguably a small bug (cosmetic, not functional) 
in DRBD 0.7.19 (haven't checked in newer versions yet so it may already 
be fixed although the changelog doesn't seem to mention it). It's minor 
but makes analysing logs rather confusing.

Let's look at the resource statuses (not connection statuses) that are 
logged during a typical failover scenario. We have node1 and node2.

At T=0, the nodes are Secondary/Secondary and have consistent data.

At T=1, node2 goes primary. All fine.

At T=2, node2 fails (cstate=Unconnected)

At T=3, node1 is made primary

At T=4, node2 recovers

Here are example log entries that DRBD 0.7.19 produces on node1 for the 
above scenario. Note that I am just showing the resource status transitions.

-------------------------------------------------------------
T0. [ nodes are consistent ]
T1. node1 kernel: drbdX: Secondary/Secondary --> Secondary/Primary
T2. [ node2 fails here, cstate=Unconnected ]
T3. node1 kernel: drbdX: Secondary/Unknown --> Primary/Unknown
T4. node1 kernel: drbdX: Primary/Unknown --> Primary/Secondary
-------------------------------------------------------------

However, note that the status at the end of the log line for T=1 is not 
the same as the status at the start of the T=3 log line. To my mind, the 
log is missing a line at T=2. It's true that at T=2, there will be lots 
of logging of cstate (cstate will rapidly go from Connected -> 
NetworkFailure -> BrokenPipe -> Unconnected -> WFConnection) but I think 
there should also be a log of the resource state.  With this, the log 
would read as follows:

-------------------------------------------------------------
T0. [ nodes are consistent ]
T1. node1 kernel: drbdX: Secondary/Secondary --> Secondary/Primary
T2. node1 kernel: drbdX: Secondary/Primary --> Secondary/Unknown
T3. node1 kernel: drbdX: Secondary/Unknown --> Primary/Unknown
T4. node1 kernel: drbdX: Primary/Unknown --> Primary/Secondary
-------------------------------------------------------------

That way, it makes more sense and doesn't look like there is something 
missing. It makes it easier to verify the exact sequence of events, 
especially if there is quite a long time (e.g. months) between T1 and T3 
(which there often is, because a box may be stable for months and then 
have a failover. In a worst case, the logs for T=1 may even have been 
rotated away.) It would be nice to easily verify that the status really 
was Secondary/Primary just before T=2.

Tim