Uploaded image for project: 'Ignite'
  1. Ignite
  2. IGNITE-18605

Account for inherent unreliability of messaging

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • 3.0
    • networking

    Description

      We use ScaleCube for discovery. It uses SWIM protocol that relies on timeuts. This means that for some reason a node might not send a ping/pong message timely (for example, if it encounters a long JVM pause), which results in other nodes thinking that it has disappeared.

      First, we should carefully choose the default values for timeouts: if they are too low, the probability of dropping a node from a cluster is very high (but if they are too high, some tests might take a lot longer).

      Also, we should account for the possibility of a node to be dropped from the cluster. This means that:

      1. When we get a node from a physical topology, we must always check that it was returned and handle 'no such node in the topology' gracefully
      2. Long message exchanges (for instance, streaming a RAFT snapshot) must be robust so as to survive short disappearances of nodes (it would be bad to waste a snapshot installation that already took 30 minutes just because of a transient failure)
      3. We need to avoid hangs in cases if someone waits for a message that never gets delivered due to such a transient failure (could this be the cause for IGNITE-18506, where sometimes a message never arrives to an internal cursor?)

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              rpuch Roman Puchkovskiy
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: