[YARN-11494] Acquired Containers are killed when the node is reconnected - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.3.3
Fix Version/s: None
Component/s: resourcemanager
Labels:
None

Description

When a nodemanager is reconnected, resourcemanager marks the acquired containers on that node as LOST and which leads to job failure.

2023-04-10 02:57:16,412 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService (IPC Server handler 41 on 8025): Reconnect from the node at: node1
2023-04-10 02:57:16,412 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService (IPC Server handler 41 on 8025): NodeManager from node node1(cmPort: 8041 httpPort: 8042) registered with capability: <memory:122880, vCores:16>, assigned nodeId node1:8041, node labels { CORE } 
2023-04-10 02:57:16,413 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl (ResourceManager Event Processor): container_e15_1677844874019_238016_01_000002 Container Transitioned from ACQUIRED to KILLED

Attachments

Activity

People

Assignee:: Prabhu Joseph

Reporter:: Prabhu Joseph

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 12/May/23 06:21

Updated:: 12/May/23 06:21