[RATIS-1947] TimeoutIOException in WriteLog might not release Pending Requests - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Not A Problem
Affects Version/s: None
Fix Version/s: None
Component/s: Leader
Labels:
None

Description

Recently there are some I/O issues in our Ozone cluster's datanodes that causes WriteLog to time out with this exceptions

2023-11-27 18:05:36,593 [dac3eec1-7886-4278-b0db-d6b62a046d44@group-2BEC613EA29B-SegmentedRaftLogWorker] WARN org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker: Timeout 0/~
org.apache.ratis.protocol.exceptions.TimeoutIOException: Timeout 10s: WriteLog:337065866: (t:7, i:337065866), STATEMACHINELOGENTRY, cmdType: WriteChunk traceID: "" containerID: 3470188 datanodeUuid: "18b90e19-6a7d-4858-a907-4acdf04a3949" pipelineID: "9de85e00-968d-4b32-a589-2bec613ea29b" writeChunk { blockID { containerID: 3470188 localID: 107544269209449942 blockCommitSequenceId: 337065456 } chunkData { chunkName: "107544269209449942_chunk_9" offset: 33554432 len: 4194304 checksumData { type: CRC32 bytesPerChecksum: 1048576 checksums: "\224\316\216\327" checksums: "\326\223/\201" checksums: "O\3374\336" checksums: "\313\231Bf" } } }, container path=/mnt/disk/3/ozone/hdds/CID-354bf147-15b8-4d31-9ff0-c0bce64451cf/current/containerDir121/3470188-writeStateMachineData
        at org.apache.ratis.util.IOUtils.getFromFuture(IOUtils.java:89)
        at org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker$StateMachineDataPolicy.getFromFuture(SegmentedRaftLogWorker.java:89)
        at org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker$WriteLog.execute(SegmentedRaftLogWorker.java:561)
        at org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker.run(SegmentedRaftLogWorker.java:321)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.TimeoutException
        at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
        at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
        at org.apache.ratis.util.IOUtils.getFromFuture(IOUtils.java:80)
        ... 4 more

This translate to state machine data write timeout shown in the metrics dashboard

After these exceptions, it's seen that the Pending Requests in leader queue is stuck and unable to return to normal, even after the I/O performance returns to normal

While the leader pending request limit hits did not increase, which means that no client requests failed with ResourceUnavailableException (but there were previous instances when these limits are hit). It seems that the client operations are blocked.

My expectations are that once I/O performance has returned, leader's pending requests should have been removed, which also release the RequestLimits semaphore. Looking at the LeaderStateImpl there are some pending requests-related methods:

Acquiring pending requests: called in RaftServerImpl#appendTransaction after StateMachine (e.g. ContainerStateMachine) finish building the TransactionContext in StateMachine#startTransaction
- tryAcquirePendingRequest: Acquiring the permit based on the current size and number of pending requests
  - If the pending requests size and limits are hit, ResourceUnavailableException should be thrown
- addPendingRequest: Using the permit, add the client request to the pending requests
Releasing pending requests: called when applying the log to state machine (means the transaction / log entry has been successfully committed)
- replyPendingRequest: Will release pending request element and size

Looking at the codes for a bit, I suspected that TimeoutIOException of WriteLog did not release the leader's pending requests which seems to cause some client operations to be blocked. However, I'm not entirely sure the expected behavior when TimeoutIOException is thrown (will it be retried indefinitely?).

Our Ozone version is still using the Ratis version based on 2.4.1 release.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

image-2023-12-01-13-28-24-125.png
01/Dec/23 05:28
115 kB
Ivan Andika
image-2023-12-01-12-04-24-409.png
01/Dec/23 04:04
111 kB
Ivan Andika

Issue Links

is related to

HDDS-9959 Propagate group remove to other datanodes during pipeline close

Resolved

relates to

HDDS-9821 XceiverServerRatis SyncTimeoutRetry is overridden

Resolved

RATIS-722 logAppend fails with Timeout exception while reading the stateMachine log entry

Resolved

HDDS-1595 Handling IO Failures on the Datanode

Resolved

TimeoutIOException in WriteLog might not release Pending Requests

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates