[SPARK-5089] Vector conversion broken for non-float64 arrays - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.2.0
Fix Version/s: 1.2.1, 1.3.0
Component/s: MLlib, PySpark
Labels:
None

Description

Prior to performing many MLlib operations in PySpark (e.g. KMeans), data are automatically converted to DenseVectors. If the data are numpy arrays with dtype float64 this works. If data are numpy arrays with lower precision (e.g. float16 or float32), they should be upcast to float64, but due to a small bug in this line this currently doesn't happen (casting is not inplace).

if ar.dtype != np.float64:
    ar.astype(np.float64)

Non-float64 values are in turn mangled during SerDe. This can have significant consequences. For example, the following yields confusing and erroneous results:

from numpy import random
from pyspark.mllib.clustering import KMeans
data = sc.parallelize(random.randn(100,10).astype('float32'))
model = KMeans.train(data, k=3)
len(model.centers[0])
>> 5 # should be 10!

But this works fine:

data = sc.parallelize(random.randn(100,10).astype('float64'))
model = KMeans.train(data, k=3)
len(model.centers[0])
>> 10 # this is correct

The fix is trivial, I'll submit a PR shortly.

Attachments

Issue Links

links to

[Github] Pull Request #3902 (freeman-lab)

Activity

People

Assignee:: Jeremy Freeman

Reporter:: Jeremy Freeman

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 05/Jan/15 17:00

Updated:: 05/Jan/15 21:12

Resolved:: 05/Jan/15 21:12