[SPARK-17219] QuantileDiscretizer should handle NaN values gracefully - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.1.0
Component/s: ML
Labels:
None

Target Version/s:

2.1.0

Description

How is the QuantileDiscretizer supposed to handle null values?
Actual nulls are not allowed, so I replace them with Double.NaN.
However, when you try to run the QuantileDiscretizer on a column that contains NaNs, it will create (possibly more than one) NaN split(s) before the final PositiveInfinity value.
I am using the attache titanic csv data and trying to bin the "age" column using the QuantileDiscretizer with 10 bins specified. The age column as a lot of null values.
These are the splits that I get:

-Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, NaN, NaN, Infinity

Is that expected. It seems to imply that NaN is larger than any positive number and less than infinity.
I'm not sure of the best way to handle nulls, but I think they need a bucket all their own. My suggestions would be to include an initial NaN split value that is always there, just like the sentinel Infinities are. If that were the case, then the splits for the example above might look like this:

NaN, -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, Infinity

This does not seem great either because a bucket that is [NaN, -Inf] doesn't make much sense. Not sure if the NaN bucket counts toward numBins or not. I do think it should always be there though in case future data has null even though the fit data did not. Thoughts?

Attachments

Issue Links

relates to

SPARK-17498 StringIndexer.setHandleInvalid should have another option 'new'

Resolved

links to

[Github] Pull Request #14858 (VinceShieh)

[Github] Pull Request #15428 (VinceShieh)

Activity

People

Assignee:: Vincent

Reporter:: Barry Becker

Shepherd:: Joseph K. Bradley

Votes:: 1 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 24/Aug/16 17:57

Updated:: 07/Feb/17 22:17

Resolved:: 27/Oct/16 18:52