Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Duplicate
-
3.1.1
-
None
-
None
Description
Let's say I want to create a UDF to do a simple lookup on a string:
import org.apache.spark.sql.{functions => f} val M = Map("a" -> "abc", "b" -> "defg") val BM = spark.sparkContext.broadcast(M) val LOOKUP = f.udf((s: String) => BM.value.get(s))
Now if I have the following dataframe:
val df = Seq( Tuple1(Seq("a", "b")) ).toDF("arr")
and I want to run this UDF over each element in the array, I can do:
df.select(f.transform($"arr", i => LOOKUP(i)).as("arr")).show(false)
This should show:
+-----------+ |arr | +-----------+ |[abc, defg]| +-----------+
However it actually shows:
+-----------+ |arr | +-----------+ |[def, defg]| +-----------+
It's also broken for SQL (even without DSL). This gives the same result:
spark.udf.register("LOOKUP",(s: String) => BM.value.get(s)) df.selectExpr("TRANSFORM(arr, a -> LOOKUP(a)) AS arr").show(false)
Note that "def" is not even in the map I'm using.
This is a big problem because it breaks existing code/UDFs. I noticed this because the job I ported from 2.4.5 to 3.1.1 seemed to be working, but was actually producing broken data.
Attachments
Issue Links
- duplicates
-
SPARK-34829 transform_values return identical values when it's used with udf that returns reference type
- Resolved