Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
Description
We use ScaleCube for discovery. It uses SWIM protocol that relies on timeuts. This means that for some reason a node might not send a ping/pong message timely (for example, if it encounters a long JVM pause), which results in other nodes thinking that it has disappeared.
First, we should carefully choose the default values for timeouts: if they are too low, the probability of dropping a node from a cluster is very high (but if they are too high, some tests might take a lot longer).
Also, we should account for the possibility of a node to be dropped from the cluster. This means that:
- When we get a node from a physical topology, we must always check that it was returned and handle 'no such node in the topology' gracefully
- Long message exchanges (for instance, streaming a RAFT snapshot) must be robust so as to survive short disappearances of nodes (it would be bad to waste a snapshot installation that already took 30 minutes just because of a transient failure)
- We need to avoid hangs in cases if someone waits for a message that never gets delivered due to such a transient failure (could this be the cause for
IGNITE-18506, where sometimes a message never arrives to an internal cursor?)
Attachments
Issue Links
- relates to
-
IGNITE-18139 Fix NPE produced by a call from InternalTableImpl#enlistWithRetry()
- Open
-
IGNITE-18292 Ignite 3 cluster sometimes hangs making KeyValueViewPocoTests.TestContains fail
- Open
-
IGNITE-18611 Change Scalecube-related timeouts to Scalecube defaults
- Resolved
-
IGNITE-18612 Make RAFT snapshot streaming resistant to network glitches
- Resolved