[HDDS-9821] XceiverServerRatis SyncTimeoutRetry is overridden - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.4.0
Component/s: Ozone Datanode
Labels:
- pull-request-available

Target Version/s:

1.4.0

Description

In XceiverServerRatis#newRaftProperties, setSyncTimeoutRetry was set twice

First, it is set to

(int) nodeFailureTimeoutMs / dataSyncTimeout.toIntExact(TimeUnit.MILLISECONDS)

which by default equals to 300_000 ms / 10_000 ms = 30 retries

From the comment, the intention of setting a finite number of retries is:

Even if the leader is not able to complete write calls within the timeout seconds, it should just fail the operation and trigger pipeline close. failing the writeStateMachine call with limited retries will ensure even the leader initiates a pipeline close if its not able to complete write in the timeout configured.

However, it was overridden in

int numSyncRetries = conf.getInt(
    OzoneConfigKeys.DFS_CONTAINER_RATIS_STATEMACHINEDATA_SYNC_RETRIES,
    OzoneConfigKeys.
        DFS_CONTAINER_RATIS_STATEMACHINEDATA_SYNC_RETRIES_DEFAULT);
RaftServerConfigKeys.Log.StateMachineData.setSyncTimeoutRetry(properties,
    numSyncRetries);

Which set it to the default value -1 (retry indefinitely).

This might cause the leader to never initiate a pipeline close when its writeStateMachine time out (e.g. write chunk timeout due to I/O issue).

I propose we use the finite timeout retry calculation as the default configuration.

This is also a good avenue to re-evaluate the state machine data policy in Container State Machine.

Attachments

Issue Links

causes

HDDS-10717 nodeFailureTimeoutMs should be initialized before syncTimeoutRetry

Resolved

is related to

HDDS-4388 Make writeStateMachineTimeout retry count proportional to node failure timeout

Resolved

RATIS-1947 TimeoutIOException in WriteLog might not release Pending Requests

Resolved

relates to

HDDS-1595 Handling IO Failures on the Datanode

Resolved

links to

GitHub Pull Request #5717

Activity

People

Assignee:: Ivan Andika

Reporter:: Ivan Andika

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 02/Dec/23 05:19

Updated:: 19/Apr/24 11:27

Resolved:: 05/Dec/23 00:35