Details
Description
The RDD cache for Python UDF is removed in 1.4, then N Python UDFs in one query stage could end up evaluate upstream (SparkPlan) 2^N times.
In 1.5, If there is aggregation or sort merge join in upstream SparkPlan, they will cause OOM (failed to acquire memory).
Attachments
Issue Links
- duplicates
-
SPARK-8632 Poor Python UDF performance because of RDD caching
- Resolved
-
SPARK-10685 Misaligned data with RDD.zip and DataFrame.withColumn after repartition
- Resolved
-
SPARK-10714 Refactor PythonRDD to decouple iterator computation from PythonRDD
- Resolved