Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-6233

Should spark.ml Models be distributed by default?

    XMLWordPrintableJSON

Details

    • Brainstorming
    • Status: Closed
    • Major
    • Resolution: Not A Problem
    • 1.4.0
    • None
    • ML
    • None

    Description

      This JIRA is for discussing a potential change for the spark.ml package.

      Issue: When an Estimator runs, it often computes helpful side information which is not stored in the returned Model. (E.g., linear methods have RDDs of residuals.) It would be nice to have this information by default, rather than having to recompute it.

      Suggestion: Introduce a DistributedModel trait. Every Estimator in the spark.ml package should be able to return a distributed model with extra info computed during training.

      Motivation: This kind of info is one of the most useful aspects of R. E.g., when you train a linear model, you can immediately summarize or plot information about the residuals. For MLlib, the user currently has to take extra steps (and computation time) to recompute this info.

      API: My general idea is as follows.

      trait Model
      trait LocalModel extends Model
      trait DistributedModel[LocalModelType: LocalModel] extends Model {
        /** convert to local model */
        def toLocal: LocalModelType
      }
      
      class LocalLDAModel extends LocalModel
      class DistributedLDAModel[LocalLDAModel] extends DistributedModel {
        def toLocal: LocalLDAModel
      }
      

      Issues with this API:

      • API stability: To keep the API stable in the future, either (a) all models should return DistributedModels, or (b) all models should return Models which can then be tested for the LocalModel or DistributedModel trait.
      • memory “leaks”: Users may not expect models to store references to RDDs, so they may be surprised by how much storage is being used.
      • naturally distributed models: Some models will simply be too large to be converted into LocalModels. It is unclear what to do here.

      Is this worthwhile?
      Pros:

      • Saving computation
      • Easier for users (skipping 1 more step of computing this info)

      Cons:

      • API issues
      • Limited savings on computation. In general, computing this info may take much less time than model training (e.g., computing residuals vs. training a GLM).

      Attachments

        Issue Links

          Activity

            People

              josephkb Joseph K. Bradley
              josephkb Joseph K. Bradley
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: