Uploaded image for project: 'Ignite'
  1. Ignite
  2. IGNITE-752

Speed up failure detection

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Blocker
    • Resolution: Fixed
    • None
    • sprint-7
    • None
    • None

    Description

      I think we can (1) make grid configuration significantly easier and (2) speed up failure detection.

      Here are disco SPI configuration properties which are responsible for failure detection:

      1. reconnectCount,
      2. sockTimeout,
      3. networkTImeout,
      4. ackTImeout,
      5. maxAckTimeout,
      6. heartbeatFrequency
      7. maxMissedHearbeats

      Same for communication SPI

      1. reconnectCount,
      2. maxConnTimeout,
      3. connTimeout

      So, we have 10 or even more properties.

      We did it to address half-opened sockets problem (which is pretty common for cloud environment) and GC pauses which may happen on cluster nodes - we can increase ack timeouts to prevent them from being kicked off the topology.

      By setting value for these props I set timeout for failure detection. Why do we need such great number of parameters instead of having 1 on IgniteConfiguration - nodeResponseThreshold (or failureDetectionThreshold - can anyone propose better name?).

      All other parameters will be calculated automatically (I think user can still set some of them for full control over situation - need to decide if this is needed.)

      Attachments

        1. 882.patch
          41 kB
          Denis A. Magda
        2. ignite-752.patch
          151 kB
          Denis A. Magda
        3. failure_detection_timeout_node_left.zip
          142 kB
          Denis A. Magda
        4. 475-2.patch
          6 kB
          Denis A. Magda

        Issue Links

          Activity

            People

              yzhdanov Yakov Zhdanov
              yzhdanov Yakov Zhdanov
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: