[HIVE-26346] Default Tez memory limits occasionally result in killing container - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.1.3
Fix Version/s: None
Component/s: Tez
Labels:
None

Description

When inserting data into Hive, the insert occasionally fails with messages like

FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Map 1, vertexId=vertex_1605060173780_0039_2_00, diagnostics=[Task failed, taskId=task_1605060173780_0039_2_00_000000, diagnostics=[TaskAttempt 0 failed, info=[Container container_1605060173780_0039_01_000002 finished with diagnostics set to [Container failed, exitCode=-104. [2020-11-11 02:35:11.768]Container [pid=16810,containerID=container_1605060173780_0039_01_000002] is running 7729152B beyond the 'PHYSICAL' memory limit. Current usage: 1.0 GB of 1 GB physical memory used; 2.5 GB of 2.1 GB virtual memory used. Killing container.

Specifically that the TezChild container is using some small amount of physical memory beyond its limit, so Tez kills the container.

Identifying how to resolve this is somewhat fraught:

There's no clear troubleshooting advice around this error from our docs. Googling led to several forums that had some good and some awful advice. https://community.cloudera.com/t5/Community-Articles/Demystify-Apache-Tez-Memory-Tuning-Step-by-Step/ta-p/245279 is probably the best one.
The issue itself comes down to Tez allocating 80% of the memory limit to Java heap (Xmx), which depending on other memory usage (stack memory, JIT, other JVM overhead) can be too little. By comparison: when running in a cgroup, Java defaults Xmx to 25% of the memory limit.
Identifying the right parameters to tune, and verifying they've been set correctly, was a bit challenging. We ended up playing with tez.container.max.java.heap.fraction, hive.tez.container.size, and yarn.scheduler.minimum-allocation-mb. I would then verify those took effect by monitoring process arguments (with htop) for any changes in Xmx. Definitely had some missteps figuring out when it's hive.tez.container vs tez.container.

In the end, any of the following seems to have worked for us

SET yarn.scheduler.minimum-allocation-mb=2048
SET tez.container.max.java.heap.fraction=0.75
SET hive.tez.container.size=2048

Attachments

Issue Links

is related to

HIVE-18308 Error inserting data into many partitions

Open

IMPALA-10316 load_nested.py failed due to out of memory during Jenkins GVO

Resolved

HIVE-22172 I have an external table and I am trying to insert data into to it, i have checked the mappings and even compared the script to a similar one and everything looks ok, but I keep having the error message below

Resolved

HIVE-22171 Issues while trying to insert into a table

Open

Activity

People

Assignee:: Unassigned

Reporter:: Michael Smith

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 21/Jun/22 17:35

Updated:: 21/Jun/22 17:36