Description
In VectorAssembler when input column lengths can not be inferred and handleInvalid = "keep", it will throw a runtime exception with message like below
Can not infer column lengths with handleInvalid = "keep". Consider using VectorSizeHint
|to add metadata for columns: [column1, column2]
However, even if you set vector size hint for column1, the message remains, and will not change to [column2] only. This is not consistent with the description in the error message.
This introduce difficulties when I try to resolve this exception, for I do not know which column required vectorSizeHint. This is especially troublesome when you have a large number of columns to deal with.
Here is a simple example:
// create a df without vector size val df = Seq( (Vectors.dense(1.0), Vectors.dense(2.0)) ).toDF("n1", "n2") // only set vector size hint for n1 column val hintedDf = new VectorSizeHint() .setInputCol("n1") .setSize(1) .transform(df) // assemble n1, n2 val output = new VectorAssembler() .setInputCols(Array("n1", "n2")) .setOutputCol("features") .setHandleInvalid("keep") .transform(hintedDf) // because only n1 has vector size, the error message should tell us to set vector size for n2 too output.show()
Expected error message:
Can not infer column lengths with handleInvalid = "keep". Consider using VectorSizeHint to add metadata for columns: [n2].
Actual error message:
Can not infer column lengths with handleInvalid = "keep". Consider using VectorSizeHint to add metadata for columns: [n1, n2].
I change one line in VectorAssembler.scala, so that it can work properly as expected.