[SPARK-14831] Make ML APIs in SparkR consistent - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 2.0.0
Fix Version/s: 2.0.0
Component/s: ML, SparkR
Labels:
None

Target Version/s:

2.0.0

Description

In current master, we have 4 ML methods in SparkR:

glm(formula, family, data, ...)
kmeans(data, centers, ...)
naiveBayes(formula, data, ...)
survreg(formula, data, ...)

We tried to keep the signatures similar to existing ones in R. However, if we put them together, they are not consistent. One example is k-means, which doesn't accept a formula. Instead of looking at each method independently, we might want to update the signature of kmeans to

kmeans(formula, data, centers, ...)

We can also discuss possible global changes here. For example, `glm` puts `family` before `data` while `kmeans` puts `centers` after `data`. This is not consistent. And logically, the formula doesn't mean anything without associating with a DataFrame. So it makes more sense to me to have the following signature:

algorithm(df, formula, [required params], [optional params])

If we make this change, we might want to avoid name collisions because they have different signature. We can use `ml.kmeans`, 'ml.glm`, etc.

Sorry for discussing API changes in the last minute. But I think it would be better to have consistent signatures in SparkR.

cc: shivaram josephkb yanboliang

Attachments

Issue Links

is related to

SPARK-14311 Model persistence in SparkR 2.0

Resolved

links to

[Github] Pull Request #12789 (thunterdb)

[Github] Pull Request #12807 (mengxr)

Activity

People

Assignee:: Timothy Hunter

Reporter:: Xiangrui Meng

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 22/Apr/16 00:45

Updated:: 30/Apr/16 06:28

Resolved:: 30/Apr/16 06:13