Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-23899

Built-in SQL Function Improvement

Details

    • Umbrella
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.3.0
    • 2.4.0
    • SQL
    • None

    Description

      This umbrella JIRA is to improve compatibility with the other data processing systems, including Hive, Teradata, Presto, Postgres, MySQL, DB2, Oracle, and MS SQL Server.

      Attachments

        Issue Links

          1.
          Add support for date extract Sub-task Resolved Yuming Wang
          2.
          format_number udf should take user specifed format as argument Sub-task Resolved Yuming Wang
          3.
          Data Masking Functions Sub-task Resolved Marco Gaido
          4.
          Provide an option in months_between UDF to disable rounding-off Sub-task Resolved Marco Gaido
          5.
          Add UDF trunc(numeric) Sub-task Resolved Yuming Wang
          6.
          Add UDF weekday Sub-task Resolved yucai
          7.
          Support regr_* functions Sub-task Resolved Marco Gaido
          8.
          High-order function: transform(array<T>, function<T, U>) → array<U> Sub-task Resolved Takuya Ueshin
          9.
          High-order function: filter(array<T>, function<T, boolean>) → array<T> Sub-task Resolved Takuya Ueshin
          10.
          High-order function: aggregate(array<T>, initialState S, inputFunction<S, T, S>, outputFunction<S, R>) → R Sub-task Resolved Takuya Ueshin
          11.
          High-order function: array_distinct(x) → array Sub-task Resolved Huaxin Gao
          12.
          High-order function: array_intersect(x, y) → array Sub-task Resolved Kazuaki Ishizaki
          13.
          High-order function: array_union(x, y) → array Sub-task Resolved Kazuaki Ishizaki
          14.
          High-order function: array_except(x, y) → array Sub-task Resolved Kazuaki Ishizaki
          15.
          High-order function: array_join(x, delimiter, null_replacement) → varchar Sub-task Resolved Marco Gaido
          16.
          High-order function: array_max(x) → x Sub-task Resolved Marco Gaido
          17.
          High-order function: array_min(x) → x Sub-task Resolved Marco Gaido
          18.
          High-order function: array_position(x, element) → bigint Sub-task Resolved Kazuaki Ishizaki
          19.
          High-order function: array_remove(x, element) → array Sub-task Resolved Huaxin Gao
          20.
          High-order function: arrays_overlap(x, y) → boolean Sub-task Resolved Marco Gaido
          21.
          High-order function: array_sort(x) → array Sub-task Resolved Kazuaki Ishizaki
          22.
          High-order function: element_at Sub-task Resolved Kazuaki Ishizaki
          23.
          High-order function: concat(array1, array2, ..., arrayN) → array Sub-task Resolved Marek Novotny
          24.
          High-order function: flatten(x) → array Sub-task Resolved Marek Novotny
          25.
          High-order function: repeat(element, count) → array Sub-task Resolved Florent Pepin
          26.
          High-order function: reverse(x) → array Sub-task Resolved Marek Novotny
          27.
          High-order function: sequence Sub-task Resolved Alex Vayda
          28.
          High-order function: shuffle(x) → array Sub-task Resolved Huizhi Lu
          29.
          High-order function: slice(x, start, length) → array Sub-task Resolved Marco Gaido
          30.
          High-order function: cardinality(x) → bigint Sub-task Resolved Kazuaki Ishizaki
          31.
          High-order function: array_zip(array1, array2[, ...]) → array<row> Sub-task Resolved Dylan Guedes
          32.
          High-order function: zip_with(array<T>, array<U>, function<T, U, R>) → array<R> Sub-task Resolved Sandeep Singh
          33.
          High-order function: map(array<K>, array<V>) → map<K,V> Sub-task Resolved Kazuaki Ishizaki
          34.
          High-order function: map_from_entries(array<row<K, V>>) → map<K,V> Sub-task Resolved Marek Novotny
          35.
          High-order function: map_entries(map<K, V>) → array<row<K,V>> Sub-task Resolved Marek Novotny
          36.
          High-order function: map_concat(map1<K, V>, map2<K, V>, ..., mapN<K, V>) → map<K,V> Sub-task Resolved Bruce Robbins
          37.
          High-order function: map_filter(map<K, V>, function<K, V, boolean>) → MAP<K,V> Sub-task Resolved Marco Gaido
          38.
          High-order function: map_zip_with(map<K, V1>, map<K, V2>, function<K, V1, V2, V3>) → map<K, V3> Sub-task Resolved Marek Novotny
          39.
          High-order function: transform_keys(map<K1, V>, function<K1, V, K2>) → map<K2,V> Sub-task Resolved Neha Patil
          40.
          High-order function: transform_values(map<K, V1>, function<K, V1, V2>) → map<K, V2> Sub-task Resolved Neha Patil
          41.
          High-order function: zip_with_index Sub-task Resolved Unassigned
          42.
          High-order function: exists(array<T>, function<T, boolean>) → boolean Sub-task Resolved Takuya Ueshin
          43.
          High-order function: filter(array<T>, function<T, Int, boolean>) → array<T> Sub-task Resolved Henry Davidge

          Activity

            wajda Alex Vayda added a comment - - edited

            What do you guys think about adding another set of convenient functions for working with multi-dimensional arrays? E.g. matrix operations like transpose, multiply and others?
            Something similar to ml.linalg.Matrix

            wajda Alex Vayda added a comment - - edited What do you guys think about adding another set of convenient functions for working with multi-dimensional arrays? E.g. matrix operations like transpose , multiply and others? Something similar to ml.linalg.Matrix
            cloud_fan Wenchen Fan added a comment -

            I'm resolving it, since there is only one subtask unfinished, which is minor to this entire story.

            cloud_fan Wenchen Fan added a comment - I'm resolving it, since there is only one subtask unfinished, which is minor to this entire story.
            georg.kf.heiler@gmail.com Georg Heiler added a comment -

            What about repartitioning by complex types, i.e. size of array? https://stackoverflow.com/questions/46240688/how-to-equally-partition-array-data-in-spark-dataframe 

            Assuming n records of data frames is almost constant but m observations define the real computational complexity a regular repartition will only ensure roughly equal amounts of n records per partition not considering the size of the array. 

             

            Ideally, I would want to make sure that especially arrays with many elements do not end up in the same partition in order to prevent data skew.

            georg.kf.heiler@gmail.com Georg Heiler added a comment - What about repartitioning by complex types, i.e. size of array? https://stackoverflow.com/questions/46240688/how-to-equally-partition-array-data-in-spark-dataframe   Assuming n records of data frames is almost constant but m observations define the real computational complexity a regular repartition will only ensure roughly equal amounts of n records per partition not considering the size of the array.    Ideally, I would want to make sure that especially arrays with many elements do not end up in the same partition in order to prevent data skew.

            What do you think about this one: SPARK-23693?

            tashoyan Arseniy Tashoyan added a comment - What do you think about this one: SPARK-23693 ?

            People

              Unassigned Unassigned
              smilegator Xiao Li
              Votes:
              3 Vote for this issue
              Watchers:
              26 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: