Details
-
Story
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
SystemDS 3.1
-
None
Description
The current unique() builtin uses a script unique.dml, which does not optimized for the Spark backend. We should therefore create an optimized implementation of unique.
The new function should behave similarly to that in R: https://www.geeksforgeeks.org/unique-function-in-r/
Discussion of the requirements can be found here:
https://github.com/apache/systemds/pull/1714
API Design
1. Row aggregation: remove duplicate rows
* R: unique() removes duplicate rows, eg
> df <- matrix(rep(1:6,length.out=9),nrow = 3,ncol=3,byrow = T)
> df
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 1 2 3
> unique(df)
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
* SystemDS: the same can be achieved using the unique() sketch like so:
unique(X, dir="r")
2. Col aggregation: remove duplicate cols
* R: unique() removes duplicate rows, so we can obtain the desired result using transpose, like so:
> df <- matrix(rep(1:6,length.out=9),nrow = 3,ncol=3)
> df
[,1] [,2] [,3]
[1,] 1 4 1
[2,] 2 5 2
[3,] 3 6 3
> df_t
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 1 2 3
> df_t_ = unique(df_t)
> df_t_
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
> df_t_t = t(df_t_)
> df_t_t
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
* SystemDS: the same can be achieved using the unique() sketch like so:
unique(X, dir="c")
3. RowCol aggregation: return only unique values in given matrix
* SystemDS
X = [[1, 1], [2, 2], [3, 3]]
unique(X) will return X' = [[1], [2], [3]]
* R
This is similar to how unique() operates on vectors in R:
> df <- c(1, 1, 2, 2, 3, 3)
> df
[1] 1 1 2 2 3 3
> unique(df)
[1] 1 2 3
The difference is that SystemDS' unique() will support the same for not only vectors, but also matrices.
Attachments
Issue Links
- links to