[SPARK-10494] Multiple Python UDFs together with aggregation or sort merge join may cause OOM (failed to acquire memory) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.5.0
Fix Version/s: 1.5.1, 1.6.0
Component/s: PySpark, SQL
Labels:
None

Target Version/s:

1.5.1, 1.6.0

Description

The RDD cache for Python UDF is removed in 1.4, then N Python UDFs in one query stage could end up evaluate upstream (SparkPlan) 2^N times.

In 1.5, If there is aggregation or sort merge join in upstream SparkPlan, they will cause OOM (failed to acquire memory).

Attachments

Issue Links

duplicates

SPARK-8632 Poor Python UDF performance because of RDD caching

Resolved

SPARK-10685 Misaligned data with RDD.zip and DataFrame.withColumn after repartition

Resolved

SPARK-10714 Refactor PythonRDD to decouple iterator computation from PythonRDD

Resolved

Activity

People

Assignee:: Reynold Xin

Reporter:: Davies Liu

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 08/Sep/15 22:21

Updated:: 23/Sep/15 19:06

Resolved:: 23/Sep/15 18:05