Description
This sample works in spark-shell 2.3.1 and throws an exception in 2.4.0
import java.util.Arrays.asList import org.apache.spark.sql.Row import org.apache.spark.sql.types._ spark.createDataFrame( asList( Row("1-1", "sp", 6), Row("1-1", "pc", 5), Row("1-2", "pc", 4), Row("2-1", "sp", 3), Row("2-2", "pc", 2), Row("2-2", "sp", 1) ), StructType(List(StructField("id", StringType), StructField("layout", StringType), StructField("n", IntegerType))) ).createOrReplaceTempView("cc") spark.createDataFrame( asList( Row("sp", 1), Row("sp", 1), Row("sp", 2), Row("sp", 3), Row("sp", 3), Row("sp", 4), Row("sp", 5), Row("sp", 5), Row("pc", 1), Row("pc", 2), Row("pc", 2), Row("pc", 3), Row("pc", 4), Row("pc", 4), Row("pc", 5) ), StructType(List(StructField("layout", StringType), StructField("ts", IntegerType))) ).createOrReplaceTempView("p") spark.createDataFrame( asList( Row("1-1", "sp", 1), Row("1-1", "sp", 2), Row("1-1", "pc", 3), Row("1-2", "pc", 3), Row("1-2", "pc", 4), Row("2-1", "sp", 4), Row("2-1", "sp", 5), Row("2-2", "pc", 6), Row("2-2", "sp", 6) ), StructType(List(StructField("id", StringType), StructField("layout", StringType), StructField("ts", IntegerType))) ).createOrReplaceTempView("c") spark.sql(""" SELECT cc.id, cc.layout, count(*) as m FROM cc JOIN p USING(layout) WHERE EXISTS(SELECT 1 FROM c WHERE c.id = cc.id AND c.layout = cc.layout AND c.ts > p.ts) GROUP BY cc.id, cc.layout """).createOrReplaceTempView("pcc") spark.sql("SELECT * FROM pcc ORDER BY id, layout").show spark.sql(""" SELECT cc.id, cc.layout, n, m FROM cc LEFT OUTER JOIN pcc ON pcc.id = cc.id AND pcc.layout = cc.layout """).createOrReplaceTempView("k") spark.sql("SELECT * FROM k ORDER BY id, layout").show
Actually I tried to catch another bug: similar calculations with joins and nested queries have different results in Spark 2.3.1 and 2.4.0, but when I tried to create a minimal example I received exception
java.lang.RuntimeException: Couldn't find id#0 in [id#38,layout#39,ts#7,id#10,layout#11,ts#12]
Attachments
Issue Links
- is related to
-
SPARK-26041 catalyst cuts out some columns from dataframes: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute
- Resolved
- links to