Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-21799

KMeans performance regression (5-6x slowdown) in Spark 2.2

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Duplicate
    • 2.2.0
    • None
    • MLlib
    • None

    Description

      I've been running KMeans performance tests using spark-sql-perf and have noticed a regression (slowdowns of 5-6x) when running tests on large datasets in Spark 2.2 vs 2.1.

      The test params are:

      • Cluster: 510 GB RAM, 16 workers
      • Data: 1000000 examples, 10000 features

      After talking to josephkb, the issue seems related to the changes in SPARK-18356 introduced in this PR.

      It seems `df.cache()` doesn't set the storageLevel of `df.rdd`, so `handlePersistence` is true even when KMeans is run on a cached DataFrame. This unnecessarily causes another copy of the input dataset to be persisted.

      As of Spark 2.1 (JIRA link) `df.storageLevel` returns the correct result after calling `df.cache()`, so I'd suggest replacing instances of `df.rdd.getStorageLevel` with df.storageLevel` in MLlib algorithms (the same pattern shows up in LogisticRegression, LinearRegression, and others). I've verified this behavior in this notebook

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              Siddharth Murching Siddharth Murching
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: