Description
Prior to performing many MLlib operations in PySpark (e.g. KMeans), data are automatically converted to DenseVectors. If the data are numpy arrays with dtype float64 this works. If data are numpy arrays with lower precision (e.g. float16 or float32), they should be upcast to float64, but due to a small bug in this line this currently doesn't happen (casting is not inplace).
if ar.dtype != np.float64: ar.astype(np.float64)
Non-float64 values are in turn mangled during SerDe. This can have significant consequences. For example, the following yields confusing and erroneous results:
from numpy import random from pyspark.mllib.clustering import KMeans data = sc.parallelize(random.randn(100,10).astype('float32')) model = KMeans.train(data, k=3) len(model.centers[0]) >> 5 # should be 10!
But this works fine:
data = sc.parallelize(random.randn(100,10).astype('float64')) model = KMeans.train(data, k=3) len(model.centers[0]) >> 10 # this is correct
The fix is trivial, I'll submit a PR shortly.