Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-39962

Global aggregation against pandas aggregate UDF does not take the column order into account

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.1.3, 3.3.0, 3.2.2, 3.4.0
    • 3.1.4, 3.3.1, 3.2.3, 3.4.0
    • PySpark
    • None

    Description

      import pandas as pd 
      from pyspark.sql import functions as f 
      
      @f.pandas_udf("double") 
      def AVG(x: pd.Series) -> float: 
          return x.mean() 
      
      
      abc = spark.createDataFrame([(1.0, 5.0, 17.0)], schema=["a", "b", "c"]) 
      abc.agg(AVG("a"), AVG("c")).show()
      abc.select("c", "a").agg(AVG("a"), AVG("c")).show()
      
      +------+------+
      |AVG(a)|AVG(c)|
      +------+------+
      |   1.0|  17.0|
      +------+------+
      
      +------+------+
      |AVG(a)|AVG(c)|
      +------+------+
      |  17.0|   1.0|
      +------+------+
      

      Both have to be the same.

      Attachments

        Activity

          No work has yet been logged on this issue.

          People

            gurwls223 Hyukjin Kwon
            gurwls223 Hyukjin Kwon
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: