Details
Description
Running the following SparkSQL query over JDBC:
SELECT * FROM FAA WHERE Year >= 1998 AND Year <= 1999 ORDER BY RAND () LIMIT 100000
This results in one or more workers throwing the following exception, with variations for mergeLo and mergeHi.
:java.lang.IllegalArgumentException: Comparison method violates its general contract! - at java.util.TimSort.mergeHi(TimSort.java:868) - at java.util.TimSort.mergeAt(TimSort.java:485) - at java.util.TimSort.mergeCollapse(TimSort.java:410) - at java.util.TimSort.sort(TimSort.java:214) - at java.util.Arrays.sort(Arrays.java:727) - at org.spark-project.guava.common.collect.Ordering.leastOf(Ordering.java:708) - at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37) - at org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1.apply(RDD.scala:1138) - at org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1.apply(RDD.scala:1135) - at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) - at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) - at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) - at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) - at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) - at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) - at org.apache.spark.scheduler.Task.run(Task.scala:56) - at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) - at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) - at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) - at java.lang.Thread.run(Thread.java:745)
We have tested with both Spark 1.2.0 and Spark 1.2.1 and have seen the same error in both. The query sometimes succeeds, but fails more often than not. Whilst this sounds similar to bugs 3032 and 3656, we believe it it is not the same.
The ORDER BY RAND () is using TimSort to produce the random ordering by sorting a list of random values. Having spent some time looking at the issue with jdb, it appears that the problem is triggered by the random values being changed during the sort - the code which triggers this is in sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Row.scala - class RowOrdering, function compare, line 250 - where a new random number is taken for the same row.
Attachments
Issue Links
- duplicates
-
SPARK-9083 If order by clause has non-deterministic expressions, we should add a project to materialize results of these expressions
- Resolved
-
SPARK-8428 TimSort Comparison method violates its general contract with CLUSTER BY
- Resolved
- relates to
-
SPARK-8428 TimSort Comparison method violates its general contract with CLUSTER BY
- Resolved