[SPARK-18358] Multiple Aggregation Using 'countDistinct' and 'first' result in error - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Duplicate
Affects Version/s: None
Fix Version/s: 2.0.2
Component/s: None
Labels:
None
Environment:

Mac OS X 10.9.5
Apache Spark 2.0.1
Hadoop 1.4

Target Version/s:

2.1.0

Description

Using pyspark, when I attempt to perform multiple aggregations on the same groupBy object using the functions 'first' and 'countDistinct' it results in a Py4JJavaError.

from pyspark.sql import SparkSession
import pyspark.sql.functions as sfn

sparkSession = SparkSession.builder.master('local').getOrCreate()

df = spark.createDataFrame([
        (1, 'a', 'z'),
        (1, 'b', 'x'),
        (1, 'a', 'y'),
        (1, 'a', 'x'),
        (2, 'b', 'z'),
        (2, 'b', 'z')
    ], ['id', 'var1', 'var2'])

## Using two 'first' and one 'countDistinct' aggregations works
df.groupby('id')    \
        .agg(sfn.first('var1'),  \
                sfn.first('var2'),  \
                sfn.countDistinct('var1')).show()
                         
## Using one 'max' with both 'countDistinct' works:
df.groupby('id')    \
         .agg(sfn.max('var2'),                \
                 sfn.countDistinct('var1'),   \
                 sfn.countDistinct('var2')).show()

## But using both 'countDistinct' with at least one 'first' crashes
df.groupby('id')    \
        .agg(sfn.first('var1'),   \
                sfn.first('var2'),   \
                sfn.countDistinct('var1'), \
                sfn.countDistinct('var2')) \
        .show()

Attachments

Issue Links

duplicates

SPARK-16648 LAST_VALUE(FALSE) OVER () throws IndexOutOfBoundsException

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Chris Nasrallah

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 08/Nov/16 16:18

Updated:: 22/Nov/16 16:13

Resolved:: 22/Nov/16 16:13