[IGNITE-18612] Make RAFT snapshot streaming resistant to network glitches - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Won't Fix
Affects Version/s: None
Fix Version/s: 3.0
Component/s: None
Labels:
- ignite-3

Epic Link:
Storage support for Rebalancing

Description

Network is inherently unreliable (see IGNITE-18605). RAFT snapshot streaming might take dozens of minutes. Current implementation breaks on the first lost/improperly handled message, so If we get a glitch in the middle of a long snapshot installation, we'll waste a lot of resources.

The idea is to:

Make requests idempotent (so that we can repeat a lost request). SnapshotMetaRequest is already idempotent. SnapshotMvDataRequest and SnapshotTxDataRequest can be provided with a sequenceNo (increased at the receiving side for each next request, but not increased for a retry of a previous request). As there is only one receiver, and it always works sequentially, on the sending side we'll have to only additionally remember the previous response and its sequenceNo to support idempotent retries that do not cause excessive cursor advancement.
On the receiving side, specify a sane timeout (like a few seconds) per request and retry requests that error or timeout (using the correct sequenceNo)
If an error happens while processing a request on the sending side, return an indication of an error to the receiver instead of just dropping the message (so that the receiver gets informed about the necessity to make a retry faster, AND the receiver can see whether it should stop retrying if the failure is fatal).

Attachments

Issue Links

is related to

IGNITE-18605 Account for inherent unreliability of messaging

Open

IGNITE-18630 Try to deliver a message until receiver drops out from logical topology

Resolved

Activity

People

Assignee:: Roman Puchkovskiy

Reporter:: Roman Puchkovskiy

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 24/Jan/23 07:52

Updated:: 04/Sep/24 16:19

Resolved:: 01/Jul/24 13:43