[HADOOP-11252] RPC client does not time out by default - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: 2.5.0
Fix Version/s: 2.8.0, 2.7.3, 2.6.4, 3.0.0-alpha1
Component/s: ipc
Labels:
None

Target Version/s:

2.8.0, 2.6.4
Hadoop Flags:

Incompatible change, Reviewed
Release Note:

Hide
This fix includes public method interface change.
A follow-up JIRA issue for this incompatibility for branch-2.7 is ~~HADOOP-13579~~.

Show
This fix includes public method interface change. A follow-up JIRA issue for this incompatibility for branch-2.7 is HADOOP-13579 .

Description

The RPC client has a default timeout set to 0 when no timeout is passed in. This means that the network connection created will not timeout when used to write data. The issue has shown in ~~YARN-2578~~ and ~~HDFS-4858~~. Timeouts for writes then fall back to the tcp level retry (configured via tcp_retries2) and timeouts between the 15-30 minutes. Which is too long for a default behaviour.

Using 0 as the default value for timeout is incorrect. We should use a sane value for the timeout and the "ipc.ping.interval" configuration value is a logical choice for it. The default behaviour should be changed from 0 to the value read for the ping interval from the Configuration.

Fixing it in common makes more sense than finding and changing all other points in the code that do not pass in a timeout.

Offending code lines:
https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488
and
https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L350

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HADOOP-11252.patch
28/Nov/14 07:08
5 kB
wilfreds#1
HADOOP-11252.004.patch
24/Dec/15 03:22
9 kB
Masatake Iwasaki
HADOOP-11252.003.patch
23/Dec/15 04:21
9 kB
Masatake Iwasaki
HADOOP-11252.002.patch
21/Dec/15 05:33
10 kB
Masatake Iwasaki

Issue Links

breaks

HADOOP-14958 CLONE - Fix source-level compatibility after HADOOP-11252

Resolved

HADOOP-13579 Fix source-level compatibility after HADOOP-11252

Closed

duplicates

HADOOP-9654 IPC timeout doesn't seem to be kicking in

Resolved

is depended upon by

HADOOP-11574 Uber-JIRA: improve Hadoop network resilience & diagnostics

Open

is duplicated by

YARN-2578 NM does not failover timely if RM node network connection fails

Resolved

YARN-5119 Timeout HA issue: Enable IPC ping for all calls by default

Resolved

is related to

YARN-2714 Localizer thread might stuck if NM is OOM

Open

relates to

HADOOP-12672 RPC timeout should not override IPC ping interval

Resolved

(1 is duplicated by, 1 is related to, 1 relates to)

Activity

People

Assignee:: Masatake Iwasaki

Reporter:: Wilfred Spiegelenburg

Votes:: 2 Vote for this issue

Watchers:: 34 Start watching this issue

Dates

Created:: 31/Oct/14 01:57

Updated:: 26/Feb/20 05:29

Resolved:: 04/Jan/16 05:41