[SPARK-4081] Categorical feature indexing - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 1.1.0
Fix Version/s: 1.4.0
Component/s: MLlib
Labels:
None

Target Version/s:

1.4.0

Description

*Updated Description*

Decision Trees and tree ensembles require that categorical features be indexed 0,1,2.... There is currently no code to aid with indexing a dataset. This is a proposal for a helper class for computing indices (and also deciding which features to treat as categorical).

Proposed functionality:

This helps process a dataset of unknown vectors into a dataset with some continuous features and some categorical features. The choice between continuous and categorical is based upon a maxCategories parameter.
This can also map categorical feature values to 0-based indices.

This is implemented in the spark.ml package for the Pipelines API, and it stores the indexes as column metadata.

Attachments

Issue Links

blocks

SPARK-6113 Stabilize DecisionTree and ensembles APIs

Resolved

relates to

SPARK-5886 Add StringIndexer

Resolved

SPARK-1216 Add a OneHotEncoder for handling categorical features

Resolved

SPARK-6915 VectorIndexer improvements

Resolved

SPARK-7585 User guide update for VectorIndexer

Resolved

links to

[Github] Pull Request #3000 (jkbradley)

(1 links to)

Activity

People

Assignee:: Joseph K. Bradley

Reporter:: Joseph K. Bradley

Votes:: 2 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 24/Oct/14 20:33

Updated:: 12/May/15 21:48

Resolved:: 13/Apr/15 05:38