[SPARK-18608] Spark ML algorithms that check RDD cache level for internal caching double-cache data - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.2.1, 2.3.0
Component/s: ML
Labels:
None

Target Version/s:

2.2.1, 2.3.0

Description

Some algorithms in Spark ML (e.g. LogisticRegression, LinearRegression, and I believe now KMeans) handle persistence internally. They check whether the input dataset is cached, and if not they cache it for performance.

However, the check is done using dataset.rdd.getStorageLevel == NONE. This will actually always be true, since even if the dataset itself is cached, the RDD returned by dataset.rdd will not be cached.

Hence if the input dataset is cached, the data will end up being cached twice, which is wasteful.

To see this:

scala> import org.apache.spark.storage.StorageLevel
import org.apache.spark.storage.StorageLevel

scala> val df = spark.range(10).toDF("num")
df: org.apache.spark.sql.DataFrame = [num: bigint]

scala> df.storageLevel == StorageLevel.NONE
res0: Boolean = true

scala> df.persist
res1: df.type = [num: bigint]

scala> df.storageLevel == StorageLevel.MEMORY_AND_DISK
res2: Boolean = true

scala> df.rdd.getStorageLevel == StorageLevel.MEMORY_AND_DISK
res3: Boolean = false

scala> df.rdd.getStorageLevel == StorageLevel.NONE
res4: Boolean = true

Before ~~SPARK-16063~~, there was no way to check the storage level of the input DataSet, but now we can, so the checks should be migrated to use dataset.storageLevel.

Attachments

Issue Links

blocks

SPARK-19422 Cache input data in algorithms

Resolved

is duplicated by

SPARK-21799 KMeans performance regression (5-6x slowdown) in Spark 2.2

Closed

links to

[Github] Pull Request #19197 (zhengruifeng)

[Github] Pull Request #19220 (yanboliang)

GitHub Pull Request #24919

Activity

People

Assignee:: Ruifeng Zheng

Reporter:: Nicholas Pentreath

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 28/Nov/16 09:28

Updated:: 20/Jun/19 05:26

Resolved:: 12/Sep/17 18:37