Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Won't Fix
-
None
-
None
Description
Network is inherently unreliable (see IGNITE-18605). RAFT snapshot streaming might take dozens of minutes. Current implementation breaks on the first lost/improperly handled message, so If we get a glitch in the middle of a long snapshot installation, we'll waste a lot of resources.
The idea is to:
- Make requests idempotent (so that we can repeat a lost request). SnapshotMetaRequest is already idempotent. SnapshotMvDataRequest and SnapshotTxDataRequest can be provided with a sequenceNo (increased at the receiving side for each next request, but not increased for a retry of a previous request). As there is only one receiver, and it always works sequentially, on the sending side we'll have to only additionally remember the previous response and its sequenceNo to support idempotent retries that do not cause excessive cursor advancement.
- On the receiving side, specify a sane timeout (like a few seconds) per request and retry requests that error or timeout (using the correct sequenceNo)
- If an error happens while processing a request on the sending side, return an indication of an error to the receiver instead of just dropping the message (so that the receiver gets informed about the necessity to make a retry faster, AND the receiver can see whether it should stop retrying if the failure is fatal).
Attachments
Issue Links
- is related to
-
IGNITE-18605 Account for inherent unreliability of messaging
- Open
-
IGNITE-18630 Try to deliver a message until receiver drops out from logical topology
- Resolved