Description
Some algorithms in Spark ML (e.g. LogisticRegression, LinearRegression, and I believe now KMeans) handle persistence internally. They check whether the input dataset is cached, and if not they cache it for performance.
However, the check is done using dataset.rdd.getStorageLevel == NONE. This will actually always be true, since even if the dataset itself is cached, the RDD returned by dataset.rdd will not be cached.
Hence if the input dataset is cached, the data will end up being cached twice, which is wasteful.
To see this:
scala> import org.apache.spark.storage.StorageLevel import org.apache.spark.storage.StorageLevel scala> val df = spark.range(10).toDF("num") df: org.apache.spark.sql.DataFrame = [num: bigint] scala> df.storageLevel == StorageLevel.NONE res0: Boolean = true scala> df.persist res1: df.type = [num: bigint] scala> df.storageLevel == StorageLevel.MEMORY_AND_DISK res2: Boolean = true scala> df.rdd.getStorageLevel == StorageLevel.MEMORY_AND_DISK res3: Boolean = false scala> df.rdd.getStorageLevel == StorageLevel.NONE res4: Boolean = true
Before SPARK-16063, there was no way to check the storage level of the input DataSet, but now we can, so the checks should be migrated to use dataset.storageLevel.
Attachments
Issue Links
- blocks
-
SPARK-19422 Cache input data in algorithms
- Resolved
- is duplicated by
-
SPARK-21799 KMeans performance regression (5-6x slowdown) in Spark 2.2
- Closed
- links to