[YARN-11355] YARN Client Failovers immediately to rm2 but takes ~30000ms to rm3 - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Patch Available
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.4.0
Fix Version/s: None
Component/s: client
Labels:
None

Description

YARN Client Failovers immediately to rm2 but takes ~30000ms to rm3 during initial retry.

Repro:

1. YARN Cluster with three master nodes rm1,rm2 and rm3
2. rm3 is active
3. yarn node -list or any other yarn client calls takes more than 30 seconds.

The initial failover to rm2 is immediate but then the failover to rm3 is after ~30000 ms. Current RetryPolicy does not honor the number of master nodes. It has to perform atleast one immediate failover to every rm.

2022-10-20 06:37:44,123 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
2022-10-20 06:37:44,129 INFO retry.RetryInvocationHandler: java.net.ConnectException: Call From local to remote:8032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused, while invoking ApplicationClientProtocolPBClientImpl.getClusterNodes over rm2 after 1 failover attempts. Trying to failover after sleeping for 21139ms.

Workaround:

Reduce yarn.resourcemanager.connect.retry-interval.ms from 30000 to like 100. This will do immediate failover to rm3 but there will be too many retries when there is no active resourcemanager.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

YARN-11355.diff
06/Jan/23 20:06
4 kB
Vineeth Naroju

Activity

People

Assignee:: Vineeth Naroju

Reporter:: Prabhu Joseph

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 20/Oct/22 07:33

Updated:: 13/Jan/23 07:11