[IGNITE-18605] Account for inherent unreliability of messaging - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: 3.0
Component/s: networking
Labels:
- ignite-3
- tech-debt

Description

We use ScaleCube for discovery. It uses SWIM protocol that relies on timeuts. This means that for some reason a node might not send a ping/pong message timely (for example, if it encounters a long JVM pause), which results in other nodes thinking that it has disappeared.

First, we should carefully choose the default values for timeouts: if they are too low, the probability of dropping a node from a cluster is very high (but if they are too high, some tests might take a lot longer).

Also, we should account for the possibility of a node to be dropped from the cluster. This means that:

When we get a node from a physical topology, we must always check that it was returned and handle 'no such node in the topology' gracefully
Long message exchanges (for instance, streaming a RAFT snapshot) must be robust so as to survive short disappearances of nodes (it would be bad to waste a snapshot installation that already took 30 minutes just because of a transient failure)
We need to avoid hangs in cases if someone waits for a message that never gets delivered due to such a transient failure (could this be the cause for ~~IGNITE-18506~~, where sometimes a message never arrives to an internal cursor?)

Attachments

Issue Links

relates to

IGNITE-18139 Fix NPE produced by a call from InternalTableImpl#enlistWithRetry()

Open

IGNITE-18292 Ignite 3 cluster sometimes hangs making KeyValueViewPocoTests.TestContains fail

Open

IGNITE-18611 Change Scalecube-related timeouts to Scalecube defaults

Resolved

IGNITE-18612 Make RAFT snapshot streaming resistant to network glitches

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Roman Puchkovskiy

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 23/Jan/23 13:09

Updated:: 04/Sep/24 16:19