[SPARK-6233] Should spark.ml Models be distributed by default? - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Brainstorming
Status: Closed
Priority: Major
Resolution: Not A Problem
Affects Version/s: 1.4.0
Fix Version/s: None
Component/s: ML
Labels:
None

Target Version/s:

1.4.0

Description

This JIRA is for discussing a potential change for the spark.ml package.

Issue: When an Estimator runs, it often computes helpful side information which is not stored in the returned Model. (E.g., linear methods have RDDs of residuals.) It would be nice to have this information by default, rather than having to recompute it.

Suggestion: Introduce a DistributedModel trait. Every Estimator in the spark.ml package should be able to return a distributed model with extra info computed during training.

Motivation: This kind of info is one of the most useful aspects of R. E.g., when you train a linear model, you can immediately summarize or plot information about the residuals. For MLlib, the user currently has to take extra steps (and computation time) to recompute this info.

API: My general idea is as follows.

trait Model
trait LocalModel extends Model
trait DistributedModel[LocalModelType: LocalModel] extends Model {
  /** convert to local model */
  def toLocal: LocalModelType
}

class LocalLDAModel extends LocalModel
class DistributedLDAModel[LocalLDAModel] extends DistributedModel {
  def toLocal: LocalLDAModel
}

Issues with this API:

API stability: To keep the API stable in the future, either (a) all models should return DistributedModels, or (b) all models should return Models which can then be tested for the LocalModel or DistributedModel trait.
memory “leaks”: Users may not expect models to store references to RDDs, so they may be surprised by how much storage is being used.
naturally distributed models: Some models will simply be too large to be converted into LocalModels. It is unclear what to do here.

Is this worthwhile?
Pros:

Saving computation
Easier for users (skipping 1 more step of computing this info)

Cons:

API issues
Limited savings on computation. In general, computing this info may take much less time than model training (e.g., computing residuals vs. training a GLM).

Attachments

Issue Links

relates to

SPARK-7412 Designing distributed prediction model abstractions for spark.ml

Resolved

Activity

People

Assignee:: Joseph K. Bradley

Reporter:: Joseph K. Bradley

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 09/Mar/15 21:39

Updated:: 06/May/15 23:34

Resolved:: 27/Mar/15 19:32