Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Not A Problem
-
2.2.0
-
None
-
None
Description
Now some algorithms cache the input dataset if it was not cached any more StorageLevel.NONE:
FeedForwardTrainer, LogisticRegression, OneVsRest, KMeans, AFTSurvivalRegression, IsotonicRegression, LinearRegression with non-WSL solver
It maybe reasonable to cache input for others:
DecisionTreeClassifier, GBTClassifier, RandomForestClassifier, LinearSVC
BisectingKMeans, GaussianMixture, LDA
DecisionTreeRegressor, GBTRegressor, GeneralizedLinearRegression with IRLS solver, RandomForestRegressor
NaiveBayes is not included since it only make one pass on the data.
MultilayerPerceptronClassifier is not included since the data is cached in FeedForwardTrainer.train
Attachments
Issue Links
- is blocked by
-
SPARK-18608 Spark ML algorithms that check RDD cache level for internal caching double-cache data
- Resolved
- is related to
-
SPARK-21972 Allow users to control input data persistence in ML Estimators via a handlePersistence ml.Param
- Resolved
- links to