Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-7131

Move tree,forest implementation from spark.mllib to spark.ml

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.4.0
    • 1.5.0
    • ML, MLlib
    • None

    Description

      We want to change and improve the spark.ml API for trees and ensembles, but we cannot change the old API in spark.mllib. To support the changes we want to make, we should move the implementation from spark.mllib to spark.ml. We will generalize and modify it, but will also ensure that we do not change the behavior of the old API.

      There are several steps to this:
      1. Copy the implementation over to spark.ml and change the spark.ml classes to use that implementation, rather than calling the spark.mllib implementation. The current spark.ml tests will ensure that the 2 implementations learn exactly the same models. Note: This should include performance testing to make sure the updated code does not have any regressions. --> UPDATE: I have run tests using spark-perf, and there were no regressions.
      2. Remove the spark.mllib implementation, and make the spark.mllib APIs wrappers around the spark.ml implementation. The spark.ml tests will again ensure that we do not change any behavior.
      3. Move the unit tests to spark.ml, and change the spark.mllib unit tests to verify model equivalence.

      This JIRA is now for step 1 only. Steps 2 and 3 will be in separate JIRAs.

      After these updates, we can more safely generalize and improve the spark.ml implementation.

      Attachments

        Issue Links

          Activity

            People

              josephkb Joseph K. Bradley
              josephkb Joseph K. Bradley
              Xiangrui Meng Xiangrui Meng
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - 168h
                  168h
                  Remaining:
                  Remaining Estimate - 168h
                  168h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified