Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-34830

Some UDF calls inside transform are broken

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 3.1.1
    • None
    • SQL
    • None

    Description

      Let's say I want to create a UDF to do a simple lookup on a string:

      import org.apache.spark.sql.{functions => f}
      val M = Map("a" -> "abc", "b" -> "defg")
      val BM = spark.sparkContext.broadcast(M)
      val LOOKUP = f.udf((s: String) => BM.value.get(s))
      

      Now if I have the following dataframe:

      val df = Seq(
          Tuple1(Seq("a", "b"))
      ).toDF("arr")
      

      and I want to run this UDF over each element in the array, I can do:

      df.select(f.transform($"arr", i => LOOKUP(i)).as("arr")).show(false)
      

      This should show:

      +-----------+
      |arr        |
      +-----------+
      |[abc, defg]|
      +-----------+
      

      However it actually shows:

      +-----------+
      |arr        |
      +-----------+
      |[def, defg]|
      +-----------+
      

      It's also broken for SQL (even without DSL). This gives the same result:

      spark.udf.register("LOOKUP",(s: String) => BM.value.get(s))
      df.selectExpr("TRANSFORM(arr, a -> LOOKUP(a)) AS arr").show(false)
      

      Note that "def" is not even in the map I'm using.

      This is a big problem because it breaks existing code/UDFs. I noticed this because the job I ported from 2.4.5 to 3.1.1 seemed to be working, but was actually producing broken data.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              dsolow1 Daniel Solow
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: