[IGNITE-752] Speed up failure detection - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Blocker
Resolution: Fixed
Affects Version/s: None
Fix Version/s: sprint-7
Component/s: None
Labels:
None

Description

I think we can (1) make grid configuration significantly easier and (2) speed up failure detection.

Here are disco SPI configuration properties which are responsible for failure detection:

reconnectCount,
sockTimeout,
networkTImeout,
ackTImeout,
maxAckTimeout,
heartbeatFrequency
maxMissedHearbeats

Same for communication SPI

reconnectCount,
maxConnTimeout,
connTimeout

So, we have 10 or even more properties.

We did it to address half-opened sockets problem (which is pretty common for cloud environment) and GC pauses which may happen on cluster nodes - we can increase ack timeouts to prevent them from being kicked off the topology.

By setting value for these props I set timeout for failure detection. Why do we need such great number of parameters instead of having 1 on IgniteConfiguration - nodeResponseThreshold (or failureDetectionThreshold - can anyone propose better name?).

All other parameters will be calculated automatically (I think user can still set some of them for full control over situation - need to decide if this is needed.)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

475-2.patch
29/Jul/15 08:54
6 kB
Denis A. Magda
failure_detection_timeout_node_left.zip
29/Jul/15 08:54
142 kB
Denis A. Magda
ignite-752.patch
23/Jul/15 07:49
151 kB
Denis A. Magda
882.patch
02/Jul/15 12:52
41 kB
Denis A. Magda

Issue Links

is related to

IGNITE-7704 Document IgniteConfiguration, TcpDiscoverySpi, TcpCommunicationSpi timeouts and their relations

Open

Activity

People

Assignee:: Yakov Zhdanov

Reporter:: Yakov Zhdanov

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 14/Apr/15 20:24

Updated:: 14/Feb/18 14:06

Resolved:: 24/Jul/15 12:27