Uploaded image for project: 'Ignite'
  1. Ignite
  2. IGNITE-18612

Make RAFT snapshot streaming resistant to network glitches

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Won't Fix
    • None
    • 3.0
    • None

    Description

      Network is inherently unreliable (see IGNITE-18605). RAFT snapshot streaming might take dozens of minutes. Current implementation breaks on the first lost/improperly handled message, so  If we get a glitch in the middle of a long snapshot installation, we'll waste a lot of resources.

      The idea is to:

      1. Make requests idempotent (so that we can repeat a lost request). SnapshotMetaRequest is already idempotent. SnapshotMvDataRequest and SnapshotTxDataRequest can be provided with a sequenceNo (increased at the receiving side for each next request, but not increased for a retry of a previous request). As there is only one receiver, and it always works sequentially, on the sending side we'll have to only additionally remember the previous response and its sequenceNo to support idempotent retries that do not cause excessive cursor advancement.
      2. On the receiving side, specify a sane timeout (like a few seconds) per request and retry requests that error or timeout (using the correct sequenceNo)
      3. If an error happens while processing a request on the sending side, return an indication of an error to the receiver instead of just dropping the message (so that the receiver gets informed about the necessity  to make a retry faster, AND the receiver can see whether it should stop retrying if the failure is fatal).

      Attachments

        Issue Links

          Activity

            People

              rpuch Roman Puchkovskiy
              rpuch Roman Puchkovskiy
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: