Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Since you did not really ask a question, I just comment your "event log" ... > Today the card reported an error at 12:54:20, and I had some > data trapped. seems after the secondary latched, the primary > thought ti snuck in another 37 blocks... > Also seems a little weird that the Primary pe: does not match > the Secondary ua:, am I just reading something into the > relationship there? well, no, the relationship you see is correct, only that accessing a moving variable of atomic_t is difficult on ONE node already, so doing this asynchronous on TWO nodes ... ok, you got the point :) arbitrarily normalized that garbage: primary ,-Connected/standAlone `* / ns nr dw dr pe ua > 12:52:55 C 3272 -500 > 12:53:05 C 3376 -396 > 12:53:08 * C 69744 0 68985 2 > 12:53:15 C 4352 580 > 12:53:18 * C 70324 580 69025 probably here the secondary already got stuck, nr and dw do not move anymore, ua stays "high". ns on primary does still move by 144 blocks, which probably just went into the "wire", i.e. filled up tcp buffers. > 12:53:25 C 6948 3176 96 > 12:53:28 * C 72940 3196 69045 101 > 12:53:35 C 6948 3176 96 now finally primary can no longer submit any data, and starts to ko-count the peer, probably at 53:40 or so. (what is the correct english term for this in boxing sports?) with 6s ping-timeout and ko-count starting at 10, this makes ... > 12:53:58 * C 73088 3348 69045 138 54:20 IO ERROR but you did not tell it to panic yet > 12:54:38 * C 73088 3348 69045 138 ... 60 seconds; right. somewhen here ko-count reached zero, and Primary decides: ok, peer does no longer cooperate, going to do my job *alone*. and imediately the dw increases again, your load when high for some seconds... then all pending IO suddenly works again, since it no longer blocks waiting for the peer. > 12:54:48 * A 73088 4556 69045 > 12:54:58 * A 73088 5444 72281 secondary still did not recognize socket shutdown yet, maybe it even does not yet know about the io-error? > 12:56:05 C 6948 3176 96 > 12:56:15 C 6948 3176 96 at least, all works as expected. ... "suprisingly" ... for a change, this one has "even" been tested before we did the release ... :P Lars Ellenberg