Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-6009

IllegalArgumentException thrown by TimSort when SQL ORDER BY RAND ()

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 1.2.0, 1.2.1, 1.3.0, 1.4.0
    • 1.5.0
    • SQL
    • None
    • Centos 7, Hadoop 2.6.0, Hive 0.15.0
      java version "1.7.0_75"
      OpenJDK Runtime Environment (rhel-2.5.4.2.el7_0-x86_64 u75-b13)
      OpenJDK 64-Bit Server VM (build 24.75-b04, mixed mode)

    Description

      Running the following SparkSQL query over JDBC:

         SELECT *
          FROM FAA
        WHERE Year >= 1998 AND Year <= 1999
          ORDER BY RAND () LIMIT 100000
      

      This results in one or more workers throwing the following exception, with variations for mergeLo and mergeHi.

          :java.lang.IllegalArgumentException: Comparison method violates its general contract!
          - at java.util.TimSort.mergeHi(TimSort.java:868)
          - at java.util.TimSort.mergeAt(TimSort.java:485)
          - at java.util.TimSort.mergeCollapse(TimSort.java:410)
          - at java.util.TimSort.sort(TimSort.java:214)
          - at java.util.Arrays.sort(Arrays.java:727)
          - at org.spark-project.guava.common.collect.Ordering.leastOf(Ordering.java:708)
          - at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37)
          - at org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1.apply(RDD.scala:1138)
          - at org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1.apply(RDD.scala:1135)
          - at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
          - at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
          - at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
          - at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
          - at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
          - at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
          - at org.apache.spark.scheduler.Task.run(Task.scala:56)
          - at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
          - at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
          - at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
          - at java.lang.Thread.run(Thread.java:745)
      

      We have tested with both Spark 1.2.0 and Spark 1.2.1 and have seen the same error in both. The query sometimes succeeds, but fails more often than not. Whilst this sounds similar to bugs 3032 and 3656, we believe it it is not the same.

      The ORDER BY RAND () is using TimSort to produce the random ordering by sorting a list of random values. Having spent some time looking at the issue with jdb, it appears that the problem is triggered by the random values being changed during the sort - the code which triggers this is in sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Row.scala - class RowOrdering, function compare, line 250 - where a new random number is taken for the same row.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              paulbarber Paul Barber
              Votes:
              2 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: