[SPARK-6915] VectorIndexer improvements - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Incomplete
Affects Version/s: 1.4.0
Fix Version/s: None
Component/s: ML
Labels:
- bulk-closed

Description

This covers several improvements to VectorIndexer. They could be handled separately or in 1 PR.

Preserving metadata

Currently, it preserves non-ML metadata. This is different from StringIndexer. We should change it so it does not maintain non-ML metadata.

Currently, it does not preserve ML-specific input metadata in the output column. If a feature is already marked as categorical or continuous, we should preserve that metadata (rather than recomputing it). We should also check that the input data is valid for that metadata.

Allow unknown categories

Add option for allowing unknown categories, probably via a parameter like "allowUnknownCategories."
If true, then handle unknown categories during transform by assigning them to an extra category index.

Index particular features

Add option for limiting indexing to particular features.
This could be specified by an option, or we could handle it via the "Preserve metadata" task above, where users would denote features as continuous in order to have VectorIndexer ignore them.

Performance optimizations

See the TODO items within VectorIndexer.scala

Attachments

Issue Links

is related to

SPARK-4081 Categorical feature indexing

Resolved

Sub-Tasks

There are no Sub-Tasks for this issue.

Activity

People

Assignee:: Unassigned

Reporter:: Joseph K. Bradley

Votes:: 1 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 14/Apr/15 22:55

Updated:: 21/May/19 04:34

Resolved:: 21/May/19 04:34