Uploaded image for project: 'SystemDS'
  1. SystemDS
  2. SYSTEMDS-1025

Perftest: Large performance variability on scenario L dense (80GB)

Details

    • Bug
    • Status: Closed
    • Blocker
    • Resolution: Fixed
    • None
    • SystemML 0.11
    • None
    • None

    Description

      During many runs of our entire performance testsuite, we've seen quite some performance variability, especially for scenario L dense (80GB) where spark operations are the dominating factor for end-to-end performance. These issues showed up over all algorithms and configurations but especially for multinomial classification and parfor scripts.

      Let's take for example Naive Bayes over the dense 10M x 1K input with 20 classes. Below are the results of 7 consecutive runs:

      NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 67
      NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 362
      NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 484
      NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 64
      NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 310
      NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 91
      NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 68
      

      After a detailed investigation, it seems that imbalance, garbage collection, and poor data locality are the reasons:

      • First, we generated the inputs with our Spark backend. Apparently, the rand operation caused imbalance due to garbage collection of some nodes. However, this is a very realistic scenario as we cannot always assume perfect balance.
      • Second, especially for multinomial classification and parfor scripts, the intermediates are not just vectors but larger matrices or there are simply more intermediates. This led again to more garbage collection.
      • Third, the scheduler delay of 3s for pending tasks was exceeded due to garbage collection, leading to remote execution which significantly slowed down the overall execution.

      To resolve these issues, we should make the following two changes:

      • (1) More conservative configuration of spark.locality.wait in systemml's preferred spark configuration, where we did not consider this at all so far.
      • (2) Improvements of reduce-all operations which current unnecessarily create intermediate pair outputs and hence unnecessary Tuple2 and MatrixIndexes objects.

      With a default scheduler delay of 5s instead of the default 3s as well as improved reduce-all for mapmm, groupedagg, tsmm, tsmm2, zipmm, and uagg, we got the following promising results (which include spark context creation and initial read):

      NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 52
      NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 45
      NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 44
      NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 44
      NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 51
      NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 50
      NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 47
      

      cc reinwald niketanpansare freiss

      Attachments

        Issue Links

          Activity

            mboehm7 Matthias Boehm added a comment -

            just to clarify: if the stated assumption of many input partitions and few output partitions holds, considering the size is actually not really helpful as it balances out with many partitions. In our case, we had less than 2x more input partitions than output partitions and hence considering the size would have helped.

            mboehm7 Matthias Boehm added a comment - just to clarify: if the stated assumption of many input partitions and few output partitions holds, considering the size is actually not really helpful as it balances out with many partitions. In our case, we had less than 2x more input partitions than output partitions and hence considering the size would have helped.
            mboehm7 Matthias Boehm added a comment -

            thanks freiss - it's certainly good to think about the big picture here. I like coalesce as a "local" operation. However, a "partition-size-aware coalesce" would have helped here - looking over CoalescedRDD.scala #https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/core/src/main/scala/org/apache/spark/rdd/CoalescedRDD.scala#L124 Line 124, it seems that right now coalesce only considers an even distribution of the number of partitions, assuming that all partitions are of equal size.

            mboehm7 Matthias Boehm added a comment - thanks freiss - it's certainly good to think about the big picture here. I like coalesce as a "local" operation. However, a "partition-size-aware coalesce" would have helped here - looking over CoalescedRDD.scala #https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/core/src/main/scala/org/apache/spark/rdd/CoalescedRDD.scala#L124 Line 124 , it seems that right now coalesce only considers an even distribution of the number of partitions, assuming that all partitions are of equal size.

            Thanks for the analysis, Matthias! Would a more robust version of RDD.coalesce() (i.e. keep partitions local when possible and shuffle the remainder) have helped in this case?

            freiss Frederick Reiss added a comment - Thanks for the analysis, Matthias! Would a more robust version of RDD.coalesce() (i.e. keep partitions local when possible and shuffle the remainder) have helped in this case?
            mboehm7 Matthias Boehm added a comment -

            with regenerated data and the full patch, the numbers now look very good and most importantly don't show any variability any longer:

            NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 42
            NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 41
            NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 41
            NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 41
            NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 41
            NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 40
            NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 41
            
            mboehm7 Matthias Boehm added a comment - with regenerated data and the full patch, the numbers now look very good and most importantly don't show any variability any longer: NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 42 NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 41 NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 41 NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 41 NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 41 NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 40 NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 41
            mboehm7 Matthias Boehm added a comment -

            just a quick update: here is what really cause the rand imbalance. Our spark rand instruction generates seeds per block and parallelizes these pairs to totalsize/hdfs_blocksize partitions in order to ensure that no partition exceeds the 2GB limitation. As it turned out, we underestimated the totalsize leading to each partition writing one full hdfs block (~ 16/17 matrix blocks) and a very small hdfs block (of just 1 or 2 matrix blocks). When this dataset was read in, Spark creates a partition per hdfs block and we performed a coalesce to a number of partitions that would match hdfs blocksize. Since coalesce avoids shuffle, we ended up with some partitions of 34 matrix blocks, some 16-19, and some 2-4. The nodes that have these very small partitions finish early, and pull partitions from the other nodes, which in turn triggers shuffle and most additional garbage collection due to deserialization of shuffle matrix blocks.

            The bottom line is, I'll create a patch including (1) the handling of the scheduler delay (which helps in terms of more robust performance), (2) better handling of reduce-all operations, and (3) changed rand/seq instructions to create balanced outputs.

            mboehm7 Matthias Boehm added a comment - just a quick update: here is what really cause the rand imbalance. Our spark rand instruction generates seeds per block and parallelizes these pairs to totalsize/hdfs_blocksize partitions in order to ensure that no partition exceeds the 2GB limitation. As it turned out, we underestimated the totalsize leading to each partition writing one full hdfs block (~ 16/17 matrix blocks) and a very small hdfs block (of just 1 or 2 matrix blocks). When this dataset was read in, Spark creates a partition per hdfs block and we performed a coalesce to a number of partitions that would match hdfs blocksize. Since coalesce avoids shuffle, we ended up with some partitions of 34 matrix blocks, some 16-19, and some 2-4. The nodes that have these very small partitions finish early, and pull partitions from the other nodes, which in turn triggers shuffle and most additional garbage collection due to deserialization of shuffle matrix blocks. The bottom line is, I'll create a patch including (1) the handling of the scheduler delay (which helps in terms of more robust performance), (2) better handling of reduce-all operations, and (3) changed rand/seq instructions to create balanced outputs.

            People

              mboehm7 Matthias Boehm
              mboehm7 Matthias Boehm
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: