Uploaded image for project: 'SystemDS'
  1. SystemDS
  2. SYSTEMDS-3463

Add unique() built-in function

    XMLWordPrintableJSON

Details

    • Story
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • SystemDS 3.1
    • SystemDS 3.1
    • Builtins
    • None

    Description

      The current unique() builtin uses a script unique.dml, which does not optimized for the Spark backend. We should therefore create an optimized implementation of unique.

      The new function should behave similarly to that in R: https://www.geeksforgeeks.org/unique-function-in-r/

      Discussion of the requirements can be found here: 

      https://github.com/apache/systemds/pull/1714

      API Design

      1. Row aggregation: remove duplicate rows

          * R: unique() removes duplicate rows, eg

              > df <- matrix(rep(1:6,length.out=9),nrow = 3,ncol=3,byrow = T)
              > df
                  [,1] [,2] [,3]
              [1,]    1    2    3
              [2,]    4    5    6
              [3,]    1    2    3
              > unique(df)
                  [,1] [,2] [,3]
              [1,]    1    2    3
              [2,]    4    5    6

          * SystemDS: the same can be achieved using the unique() sketch like so:

              unique(X, dir="r")

      2. Col aggregation: remove duplicate cols

          * R: unique() removes duplicate rows, so we can obtain the desired result using transpose, like so:

              > df <- matrix(rep(1:6,length.out=9),nrow = 3,ncol=3)
              > df
                  [,1] [,2] [,3]
              [1,]    1    4    1
              [2,]    2    5    2
              [3,]    3    6    3
              > df_t
                  [,1] [,2] [,3]
              [1,]    1    2    3
              [2,]    4    5    6
              [3,]    1    2    3
              > df_t_ = unique(df_t)
              > df_t_
                  [,1] [,2] [,3]
              [1,]    1    2    3
              [2,]    4    5    6
              > df_t_t = t(df_t_)
              > df_t_t
                  [,1] [,2]
              [1,]    1    4
              [2,]    2    5
              [3,]    3    6

          * SystemDS: the same can be achieved using the unique() sketch like so:

              unique(X, dir="c")

      3. RowCol aggregation: return only unique values in given matrix

          * SystemDS

              X = [[1, 1], [2, 2], [3, 3]]
              unique(X) will return X' = [[1], [2], [3]]
          
          * R

          This is similar to how unique() operates on vectors in R:

              > df <- c(1, 1, 2, 2, 3, 3)
              > df
              [1] 1 1 2 2 3 3
              > unique(df)
              [1] 1 2 3

          The difference is that SystemDS' unique() will support the same for not only vectors, but also matrices.

       

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              badrul_c Badrul Chowdhury
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: