Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.1.0
    • SQL
    • None

    Attachments

      Issue Links

        Activity

          rxin Reynold Xin added a comment -

          thunterdb can we use your implementation for percentile_approx?

          rxin Reynold Xin added a comment - thunterdb can we use your implementation for percentile_approx?
          thunterdb Tim Hunter added a comment -

          We should, the algorithm picked is optimized for this use case.

          thunterdb Tim Hunter added a comment - We should, the algorithm picked is optimized for this use case.
          proflin Liwei Lin(Inactive) added a comment - - edited

          Hive's percentile_approx implementation computes approximate percentile values from a histogram (please refer to Hive/GenericUDAFPercentileApprox.java and Hive/NumericHistogram.java for details):

          • Hive's percentile_approx's signature is: _FUNC_(expr, pc, [nb])
          • parameter [nb] – the number of histogram bins to use – is optionally specified by users
          • if the number of unique values in the actual dataset is less than or equals to this [nb], we can expect an exact result; otherwise there are no approximation guarantees

          Our Dataset's approxQuantile() implementation is not really histogram-based (and thus differs from Hive's implementation):

          • our Dataset's approxQuantile()'s signature is something like: _FUNC_(expr, pc, relativeError)
          • parameter relativeError is specified by users and should be in [0, 1]; our approximation is deterministicly bounded by this relativeError – please refer to Spark/DataFrameStatFunctions.scala for details

          Since there's no direct deterministic relationship between [nb] and relativeError, it seems hard to build Hive's percentile_approx on top of our Dataset's approxQuantile(). So should we: (a) port Hive' implementation into Spark, and provide _FUNC_(expr, pc, [nb]) on top of it, or (b) provide _FUNC_(expr, pc, relativeError) directly on top of our Dataset's approxQuantile() implementation, but this might be incompatible with Hive? rxin, thunterdb could you share some thoughts? Thanks !

          proflin Liwei Lin(Inactive) added a comment - - edited Hive's percentile_approx implementation computes approximate percentile values from a histogram (please refer to Hive/GenericUDAFPercentileApprox.java and Hive/NumericHistogram.java for details): Hive's percentile_approx's signature is: _FUNC_(expr, pc, [nb]) parameter [nb] – the number of histogram bins to use – is optionally specified by users if the number of unique values in the actual dataset is less than or equals to this [nb], we can expect an exact result; otherwise there are no approximation guarantees Our Dataset's approxQuantile() implementation is not really histogram-based (and thus differs from Hive's implementation): our Dataset's approxQuantile()'s signature is something like: _FUNC_(expr, pc, relativeError) parameter relativeError is specified by users and should be in [0, 1]; our approximation is deterministicly bounded by this relativeError – please refer to Spark/DataFrameStatFunctions.scala for details Since there's no direct deterministic relationship between [nb] and relativeError, it seems hard to build Hive's percentile_approx on top of our Dataset's approxQuantile(). So should we: (a) port Hive' implementation into Spark, and provide _FUNC_(expr, pc, [nb]) on top of it, or (b) provide _FUNC_(expr, pc, relativeError) directly on top of our Dataset's approxQuantile() implementation, but this might be incompatible with Hive? rxin , thunterdb could you share some thoughts? Thanks !
          vectorijk Kai Jiang added a comment -

          I also noticed that there is an inconsistency between hive's approach and dataset's approach. Which one should we go with? Cause it's a function passed over to hive, I vote to port hive's implementation to spark. rxin, thunterdb could you share some ideas on this? Thanks! I also would love to try on this one once we decide which way to go with.

          vectorijk Kai Jiang added a comment - I also noticed that there is an inconsistency between hive's approach and dataset's approach. Which one should we go with? Cause it's a function passed over to hive, I vote to port hive's implementation to spark. rxin , thunterdb could you share some ideas on this? Thanks! I also would love to try on this one once we decide which way to go with.
          thunterdb Tim Hunter added a comment -

          Are we trying to reproduce Hive's results here? In this case, then yes there is no choice but port Hive's code. If we just want to have an equivalent result, then we can use the following pseudo-python-code:

          def percentile_approx(df, x, num_hist):
            return quantile_approx(df, x, max(1/num_hist, 1e-3) )
          

          The final result has the advantage over hive to have theoretical bounds on the result. The only issue is that the runtime in this case is O(num_hist ^ 2) (instead of linear) if I remember correctly.

          Also, if we want to spend more time on improving the algorithms, I would prefer something that has some known guarantees rather than something completely novel.

          thunterdb Tim Hunter added a comment - Are we trying to reproduce Hive's results here? In this case, then yes there is no choice but port Hive's code. If we just want to have an equivalent result, then we can use the following pseudo-python-code: def percentile_approx(df, x, num_hist): return quantile_approx(df, x, max(1/num_hist, 1e-3) ) The final result has the advantage over hive to have theoretical bounds on the result. The only issue is that the runtime in this case is O(num_hist ^ 2) (instead of linear) if I remember correctly. Also, if we want to spend more time on improving the algorithms, I would prefer something that has some known guarantees rather than something completely novel.
          rxin Reynold Xin added a comment -

          We just need a function, and doesn't need it to be identical to Hive's result.

          rxin Reynold Xin added a comment - We just need a function, and doesn't need it to be identical to Hive's result.

          Thanks for the clarification. I'm working on this one, thanks!

          proflin Liwei Lin(Inactive) added a comment - Thanks for the clarification. I'm working on this one, thanks!
          apachespark Apache Spark added a comment -

          User 'lw-lin' has created a pull request for this issue:
          https://github.com/apache/spark/pull/14237

          apachespark Apache Spark added a comment - User 'lw-lin' has created a pull request for this issue: https://github.com/apache/spark/pull/14237
          apachespark Apache Spark added a comment -

          User 'lw-lin' has created a pull request for this issue:
          https://github.com/apache/spark/pull/14237

          apachespark Apache Spark added a comment - User 'lw-lin' has created a pull request for this issue: https://github.com/apache/spark/pull/14237
          apachespark Apache Spark added a comment -

          User 'lw-lin' has created a pull request for this issue:
          https://github.com/apache/spark/pull/14298

          apachespark Apache Spark added a comment - User 'lw-lin' has created a pull request for this issue: https://github.com/apache/spark/pull/14298
          apachespark Apache Spark added a comment -

          User 'lw-lin' has created a pull request for this issue:
          https://github.com/apache/spark/pull/14237

          apachespark Apache Spark added a comment - User 'lw-lin' has created a pull request for this issue: https://github.com/apache/spark/pull/14237
          clockfly Sean Zhong added a comment - - edited

          Created a sub-task SPARK-17188 to move QuantileSummaries to package org.apache.spark.sql.util of catalyst project

          clockfly Sean Zhong added a comment - - edited Created a sub-task SPARK-17188 to move QuantileSummaries to package org.apache.spark.sql.util of catalyst project
          apachespark Apache Spark added a comment -

          User 'clockfly' has created a pull request for this issue:
          https://github.com/apache/spark/pull/14868

          apachespark Apache Spark added a comment - User 'clockfly' has created a pull request for this issue: https://github.com/apache/spark/pull/14868
          cloud_fan Wenchen Fan added a comment -

          Issue resolved by pull request 14868
          https://github.com/apache/spark/pull/14868

          cloud_fan Wenchen Fan added a comment - Issue resolved by pull request 14868 https://github.com/apache/spark/pull/14868
          apachespark Apache Spark added a comment -

          User 'lw-lin' has created a pull request for this issue:
          https://github.com/apache/spark/pull/14237

          apachespark Apache Spark added a comment - User 'lw-lin' has created a pull request for this issue: https://github.com/apache/spark/pull/14237
          erlu chenerlu added a comment - - edited

          Hi, I am little confused about percentile_approx, is it different from hive's now ? will we get different result when the input is same ?

          for example, I run select percentile_approx(c4_double,array(0.1,0.2,0.3,0.4)) from test; and get different result.

          c4_double is show below:
          1.00000001
          2.00000001
          3.00000001
          4.00000001
          5.00000001
          6.00000001
          7.00000001
          8.00000001
          9.00000001
          NULL
          -8.952
          -96.0

          Hive:
          [-87.2952,-6.961599997999999,1.3000000099999998,2.4000000100000003]

          spark 2.x:
          [-8.952,1.00000001,2.00000001,3.00000001]

          so which result is right ? Could you pls reply me when you are free.

          rxin lwlin

          erlu chenerlu added a comment - - edited Hi, I am little confused about percentile_approx, is it different from hive's now ? will we get different result when the input is same ? for example, I run select percentile_approx(c4_double,array(0.1,0.2,0.3,0.4)) from test; and get different result. c4_double is show below: 1.00000001 2.00000001 3.00000001 4.00000001 5.00000001 6.00000001 7.00000001 8.00000001 9.00000001 NULL -8.952 -96.0 Hive: [-87.2952,-6.961599997999999,1.3000000099999998,2.4000000100000003] spark 2.x: [-8.952,1.00000001,2.00000001,3.00000001] so which result is right ? Could you pls reply me when you are free. rxin lwlin
          ZenWzh Zhenhua Wang added a comment - - edited

          erlu I think it's been made clear from the above discussions, Spark's result doesn't have to be the same as Hive's result.

          ZenWzh Zhenhua Wang added a comment - - edited erlu I think it's been made clear from the above discussions, Spark's result doesn't have to be the same as Hive's result.

          People

            clockfly Sean Zhong
            rxin Reynold Xin
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: