Description
This JIRA is for discussing a potential change for the spark.ml package.
Issue: When an Estimator runs, it often computes helpful side information which is not stored in the returned Model. (E.g., linear methods have RDDs of residuals.) It would be nice to have this information by default, rather than having to recompute it.
Suggestion: Introduce a DistributedModel trait. Every Estimator in the spark.ml package should be able to return a distributed model with extra info computed during training.
Motivation: This kind of info is one of the most useful aspects of R. E.g., when you train a linear model, you can immediately summarize or plot information about the residuals. For MLlib, the user currently has to take extra steps (and computation time) to recompute this info.
API: My general idea is as follows.
trait Model trait LocalModel extends Model trait DistributedModel[LocalModelType: LocalModel] extends Model { /** convert to local model */ def toLocal: LocalModelType } class LocalLDAModel extends LocalModel class DistributedLDAModel[LocalLDAModel] extends DistributedModel { def toLocal: LocalLDAModel }
Issues with this API:
- API stability: To keep the API stable in the future, either (a) all models should return DistributedModels, or (b) all models should return Models which can then be tested for the LocalModel or DistributedModel trait.
- memory “leaks”: Users may not expect models to store references to RDDs, so they may be surprised by how much storage is being used.
- naturally distributed models: Some models will simply be too large to be converted into LocalModels. It is unclear what to do here.
Is this worthwhile?
Pros:
- Saving computation
- Easier for users (skipping 1 more step of computing this info)
Cons:
- API issues
- Limited savings on computation. In general, computing this info may take much less time than model training (e.g., computing residuals vs. training a GLM).
Attachments
Issue Links
- relates to
-
SPARK-7412 Designing distributed prediction model abstractions for spark.ml
- Resolved