Details
Description
A. DenseVector-based BinaryObjectVectorizer
When using existing caches as a source of Datasets, the BinaryObjectVectorizer is used.
The existing BinaryObjectVectorizer only supports the creation of a SparseVector.
The LUDecomposition utility that supports gaussian factorization for models like GMM have a "Singularity indicator" for which a SparseVector and its null handling will set a matrix column calculation to be zero/0.0 which is below the minimum check value (1e-11) and thus indicate a matrix is not square.
This null handling of the SparseMatrix will restrict the use of some algorithms like Gaussian Mixture Models where any Vector dimension that is null will incorrectly signal that a matrix is not square.
It would be great if we could:
- Have a BinaryObjectVectorizer that uses a DenseMatrix to eliminate this singularity trigger and enable use of GMM Trainer.
B. CacheBasedDatasets not treated as Temporary Cache
When using a cache-based dataset, the close() method destroys the Ignite cache. This means that there is no ability to re-use the data loaded into this dataset.
It would be great if we could:
- Not destroy the Ignite Cache holding the dataset on close (of one step in an ML processing flow)
- Allow for "attaching" to this prior, pre-calculated dataset in subsequent use.
C. Vector Visibility
Vectors (unlike other value types, e.g. BinaryObjects) are not visible in standard mechanisms, like the Ignite Web Console, where the toString() method does not present any information about the embedded vector values.
It would be great if we could:
- have a Vector.toString() method implementation that presented some information about what is actually in the Vector.
I have implemented the above items and have used them at a customer where I needed these capabilities (or at least it dramatically reduced the cost and increased the value of the solution).
It would be great if the community was supportive of this expansion/improvement of the Ignite ML library.
Thanks,
Glenn