[SPARK-35563] [SQL] Window operations with over Int.MaxValue + 1 rows can silently drop rows - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.0.2
Fix Version/s: None
Component/s: SQL
Labels:
- data-loss

Description

I think this impacts a lot more versions of Spark, but I don't know for sure because it takes a long time to test. As a part of doing corner case validation testing for spark rapids I found that if a window function has more than Int.MaxValue + 1 rows the result is silently truncated to that many rows. I have only tested this on 3.0.2 with row_number, but I suspect it will impact others as well. This is a really rare corner case, but because it is silent data corruption I personally think it is quite serious.

import org.apache.spark.sql.expressions.Window

val windowSpec = Window.partitionBy("a").orderBy("b")

val df = spark.range(Int.MaxValue.toLong + 100).selectExpr(s"1 as a", "id as b")

spark.time(df.select(col("a"), col("b"), row_number().over(windowSpec).alias("rn")).orderBy(desc("a"), desc("b")).select((col("rn") < 0).alias("dir")).groupBy("dir").count.show(20))

+-----+----------+                                                              
|  dir|     count|
+-----+----------+
|false|2147483647|
| true|         1|
+-----+----------+

Time taken: 1139089 ms

Int.MaxValue.toLong + 100
res15: Long = 2147483747

2147483647L + 1
res16: Long = 2147483648

I had to make sure that I ran the above with at least 64GiB of heap for the executor (I did it in local mode and it worked, but took forever to run)

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Robert Joseph Evans

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 29/May/21 15:53

Updated:: 22/Feb/23 23:21