Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-7412

Designing distributed prediction model abstractions for spark.ml

    XMLWordPrintableJSON

Details

    • Brainstorming
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • None
    • None
    • ML

    Description

      The Pipelines API (spark.ml package) now includes abstractions for single-label prediction: Predictor, Classifier, Regressor. These assume models are local, where single-Row prediction methods can be used as UDFs. We need to think about how to support distributed models in these abstractions.

      Should the abstractions be modified somehow? Or should there be parallel (or inheriting) abstractions, or a mix-in?

      Motivation: We may start supporting distributed models since linear models, random forests, and other models can get large enough to merit distributed storage and computation.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              josephkb Joseph K. Bradley
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: