Description
*Updated Description*
Decision Trees and tree ensembles require that categorical features be indexed 0,1,2.... There is currently no code to aid with indexing a dataset. This is a proposal for a helper class for computing indices (and also deciding which features to treat as categorical).
Proposed functionality:
- This helps process a dataset of unknown vectors into a dataset with some continuous features and some categorical features. The choice between continuous and categorical is based upon a maxCategories parameter.
- This can also map categorical feature values to 0-based indices.
This is implemented in the spark.ml package for the Pipelines API, and it stores the indexes as column metadata.
Attachments
Issue Links
- blocks
-
SPARK-6113 Stabilize DecisionTree and ensembles APIs
- Resolved
- relates to
-
SPARK-5886 Add StringIndexer
- Resolved
-
SPARK-1216 Add a OneHotEncoder for handling categorical features
- Resolved
-
SPARK-6915 VectorIndexer improvements
- Resolved
-
SPARK-7585 User guide update for VectorIndexer
- Resolved
- links to