Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-14831

Make ML APIs in SparkR consistent

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 2.0.0
    • 2.0.0
    • ML, SparkR
    • None

    Description

      In current master, we have 4 ML methods in SparkR:

      glm(formula, family, data, ...)
      kmeans(data, centers, ...)
      naiveBayes(formula, data, ...)
      survreg(formula, data, ...)
      

      We tried to keep the signatures similar to existing ones in R. However, if we put them together, they are not consistent. One example is k-means, which doesn't accept a formula. Instead of looking at each method independently, we might want to update the signature of kmeans to

      kmeans(formula, data, centers, ...)
      

      We can also discuss possible global changes here. For example, `glm` puts `family` before `data` while `kmeans` puts `centers` after `data`. This is not consistent. And logically, the formula doesn't mean anything without associating with a DataFrame. So it makes more sense to me to have the following signature:

      algorithm(df, formula, [required params], [optional params])
      

      If we make this change, we might want to avoid name collisions because they have different signature. We can use `ml.kmeans`, 'ml.glm`, etc.

      Sorry for discussing API changes in the last minute. But I think it would be better to have consistent signatures in SparkR.

      cc: shivaram josephkb yanboliang

      Attachments

        Issue Links

          Activity

            People

              timhunter Timothy Hunter
              mengxr Xiangrui Meng
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: