Details
Description
Using pyspark, when I attempt to perform multiple aggregations on the same groupBy object using the functions 'first' and 'countDistinct' it results in a Py4JJavaError.
from pyspark.sql import SparkSession import pyspark.sql.functions as sfn sparkSession = SparkSession.builder.master('local').getOrCreate() df = spark.createDataFrame([ (1, 'a', 'z'), (1, 'b', 'x'), (1, 'a', 'y'), (1, 'a', 'x'), (2, 'b', 'z'), (2, 'b', 'z') ], ['id', 'var1', 'var2']) ## Using two 'first' and one 'countDistinct' aggregations works df.groupby('id') \ .agg(sfn.first('var1'), \ sfn.first('var2'), \ sfn.countDistinct('var1')).show() ## Using one 'max' with both 'countDistinct' works: df.groupby('id') \ .agg(sfn.max('var2'), \ sfn.countDistinct('var1'), \ sfn.countDistinct('var2')).show() ## But using both 'countDistinct' with at least one 'first' crashes df.groupby('id') \ .agg(sfn.first('var1'), \ sfn.first('var2'), \ sfn.countDistinct('var1'), \ sfn.countDistinct('var2')) \ .show()
Attachments
Issue Links
- duplicates
-
SPARK-16648 LAST_VALUE(FALSE) OVER () throws IndexOutOfBoundsException
- Resolved