Description
StringIndexer maps labels to numbers according to the descending order of label frequency. Other types of ordering (e.g., alphabetical) may be needed in feature ETL. For example, the ordering will affect the result in one-hot encoding and RFormula. Propose to support other ordering methods and we add a parameter stringOrderType that supports the following four options:
- 'freq_desc': descending order by label frequency (most frequent label assigned 0)
- 'freq_asc': ascending order by label frequency (least frequent label assigned 0)
- 'alphabet_desc': descending alphabetical order
- 'alphabet_asc': ascending alphabetical order
Attachments
Issue Links
- is related to
-
SPARK-20899 PySpark supports stringIndexerOrderType in RFormula
- Resolved
- relates to
-
SPARK-23231 Add doc for string indexer ordering to user guide (also to RFormula guide)
- Resolved
- links to