[SYSTEMDS-3463] Add unique() built-in function - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Story
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: SystemDS 3.1
Fix Version/s: SystemDS 3.1
Component/s: Builtins
Labels:
None

Description

The current unique() builtin uses a script unique.dml, which does not optimized for the Spark backend. We should therefore create an optimized implementation of unique.

The new function should behave similarly to that in R: https://www.geeksforgeeks.org/unique-function-in-r/

Discussion of the requirements can be found here:

https://github.com/apache/systemds/pull/1714

API Design

1. Row aggregation: remove duplicate rows

* R: unique() removes duplicate rows, eg

> df <- matrix(rep(1:6,length.out=9),nrow = 3,ncol=3,byrow = T)
> df
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 1 2 3
> unique(df)
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6

* SystemDS: the same can be achieved using the unique() sketch like so:

unique(X, dir="r")

2. Col aggregation: remove duplicate cols

* R: unique() removes duplicate rows, so we can obtain the desired result using transpose, like so:

> df <- matrix(rep(1:6,length.out=9),nrow = 3,ncol=3)
> df
[,1] [,2] [,3]
[1,] 1 4 1
[2,] 2 5 2
[3,] 3 6 3
> df_t
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 1 2 3
> df_t_ = unique(df_t)
> df_t_
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
> df_t_t = t(df_t_)
> df_t_t
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6

* SystemDS: the same can be achieved using the unique() sketch like so:

unique(X, dir="c")

3. RowCol aggregation: return only unique values in given matrix

* SystemDS

X = [[1, 1], [2, 2], [3, 3]]
unique(X) will return X' = [[1], [2], [3]]

* R

This is similar to how unique() operates on vectors in R:

> df <- c(1, 1, 2, 2, 3, 3)
> df
[1] 1 1 2 2 3 3
> unique(df)
[1] 1 2 3

The difference is that SystemDS' unique() will support the same for not only vectors, but also matrices.

Attachments

Issue Links

links to

GitHub Pull Request #1740

Activity

People

Assignee:: Unassigned

Reporter:: Badrul Chowdhury

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 08/Nov/22 18:06

Updated:: 15/Mar/23 19:13

Resolved:: 27/Jan/23 14:26