[SPARK-19422] Cache input data in algorithms - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Not A Problem
Affects Version/s: 2.2.0
Fix Version/s: None
Component/s: ML
Labels:
None

Description

Now some algorithms cache the input dataset if it was not cached any more StorageLevel.NONE:
FeedForwardTrainer, LogisticRegression, OneVsRest, KMeans, AFTSurvivalRegression, IsotonicRegression, LinearRegression with non-WSL solver

It maybe reasonable to cache input for others:
DecisionTreeClassifier, GBTClassifier, RandomForestClassifier, LinearSVC
BisectingKMeans, GaussianMixture, LDA
DecisionTreeRegressor, GBTRegressor, GeneralizedLinearRegression with IRLS solver, RandomForestRegressor

NaiveBayes is not included since it only make one pass on the data.
MultilayerPerceptronClassifier is not included since the data is cached in FeedForwardTrainer.train

Attachments

Issue Links

is blocked by

SPARK-18608 Spark ML algorithms that check RDD cache level for internal caching double-cache data

Resolved

is related to

SPARK-21972 Allow users to control input data persistence in ML Estimators via a handlePersistence ml.Param

Resolved

links to

[Github] Pull Request #16763 (zhengruifeng)

Activity

People

Assignee:: Unassigned

Reporter:: Ruifeng Zheng

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 01/Feb/17 08:36

Updated:: 14/Jun/18 09:14

Resolved:: 14/Jun/18 09:14