[SPARK-16283] Implement percentile_approx SQL function - ASF JIRA

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.1.0
Component/s: SQL
Labels:
None

Attachments

Issue Links

is related to

SPARK-18892 Alias percentile_approx approx_percentile

Resolved

relates to

SPARK-9299 percentile and percentile_approx aggregate functions

Resolved

SPARK-17188 Moves QuantileSummaries to project catalyst from sql so that it can be used to implement percentile_approx

Resolved

links to

[Github] Pull Request #14237 (lw-lin)

[Github] Pull Request #14298 (lw-lin)

[Github] Pull Request #14868 (clockfly)

(1 links to)

Activity

Ascending order - Click to sort in descending order

Reynold Xin added a comment - 11/Jul/16 01:23

thunterdb can we use your implementation for percentile_approx?

Reynold Xin added a comment - 11/Jul/16 01:23 thunterdb can we use your implementation for percentile_approx?

Tim Hunter added a comment - 11/Jul/16 16:34

We should, the algorithm picked is optimized for this use case.

Tim Hunter added a comment - 11/Jul/16 16:34 We should, the algorithm picked is optimized for this use case.

Liwei Lin(Inactive) added a comment - 13/Jul/16 04:00 - edited

Hive's percentile_approx implementation computes approximate percentile values from a histogram (please refer to Hive/GenericUDAFPercentileApprox.java and Hive/NumericHistogram.java for details):

Hive's percentile_approx's signature is: _FUNC_(expr, pc, [nb])
parameter [nb] – the number of histogram bins to use – is optionally specified by users
if the number of unique values in the actual dataset is less than or equals to this [nb], we can expect an exact result; otherwise there are no approximation guarantees

Our Dataset's approxQuantile() implementation is not really histogram-based (and thus differs from Hive's implementation):

our Dataset's approxQuantile()'s signature is something like: _FUNC_(expr, pc, relativeError)
parameter relativeError is specified by users and should be in [0, 1]; our approximation is deterministicly bounded by this relativeError – please refer to Spark/DataFrameStatFunctions.scala for details

Since there's no direct deterministic relationship between [nb] and relativeError, it seems hard to build Hive's percentile_approx on top of our Dataset's approxQuantile(). So should we: (a) port Hive' implementation into Spark, and provide _FUNC_(expr, pc, [nb]) on top of it, or (b) provide _FUNC_(expr, pc, relativeError) directly on top of our Dataset's approxQuantile() implementation, but this might be incompatible with Hive? rxin, thunterdb could you share some thoughts? Thanks !

Liwei Lin(Inactive) added a comment - 13/Jul/16 04:00 - edited Hive's percentile_approx implementation computes approximate percentile values from a histogram (please refer to Hive/GenericUDAFPercentileApprox.java and Hive/NumericHistogram.java for details): Hive's percentile_approx's signature is: _FUNC_(expr, pc, [nb]) parameter [nb] – the number of histogram bins to use – is optionally specified by users if the number of unique values in the actual dataset is less than or equals to this [nb], we can expect an exact result; otherwise there are no approximation guarantees Our Dataset's approxQuantile() implementation is not really histogram-based (and thus differs from Hive's implementation): our Dataset's approxQuantile()'s signature is something like: _FUNC_(expr, pc, relativeError) parameter relativeError is specified by users and should be in [0, 1]; our approximation is deterministicly bounded by this relativeError – please refer to Spark/DataFrameStatFunctions.scala for details Since there's no direct deterministic relationship between [nb] and relativeError, it seems hard to build Hive's percentile_approx on top of our Dataset's approxQuantile(). So should we: (a) port Hive' implementation into Spark, and provide _FUNC_(expr, pc, [nb]) on top of it, or (b) provide _FUNC_(expr, pc, relativeError) directly on top of our Dataset's approxQuantile() implementation, but this might be incompatible with Hive? rxin , thunterdb could you share some thoughts? Thanks !

Kai Jiang added a comment - 13/Jul/16 17:53

I also noticed that there is an inconsistency between hive's approach and dataset's approach. Which one should we go with? Cause it's a function passed over to hive, I vote to port hive's implementation to spark. rxin, thunterdb could you share some ideas on this? Thanks! I also would love to try on this one once we decide which way to go with.

Kai Jiang added a comment - 13/Jul/16 17:53 I also noticed that there is an inconsistency between hive's approach and dataset's approach. Which one should we go with? Cause it's a function passed over to hive, I vote to port hive's implementation to spark. rxin , thunterdb could you share some ideas on this? Thanks! I also would love to try on this one once we decide which way to go with.

Tim Hunter added a comment - 14/Jul/16 20:24

Are we trying to reproduce Hive's results here? In this case, then yes there is no choice but port Hive's code. If we just want to have an equivalent result, then we can use the following pseudo-python-code:

def percentile_approx(df, x, num_hist):
  return quantile_approx(df, x, max(1/num_hist, 1e-3) )

The final result has the advantage over hive to have theoretical bounds on the result. The only issue is that the runtime in this case is O(num_hist ^ 2) (instead of linear) if I remember correctly.

Also, if we want to spend more time on improving the algorithms, I would prefer something that has some known guarantees rather than something completely novel.

Tim Hunter added a comment - 14/Jul/16 20:24 Are we trying to reproduce Hive's results here? In this case, then yes there is no choice but port Hive's code. If we just want to have an equivalent result, then we can use the following pseudo-python-code: def percentile_approx(df, x, num_hist): return quantile_approx(df, x, max(1/num_hist, 1e-3) ) The final result has the advantage over hive to have theoretical bounds on the result. The only issue is that the runtime in this case is O(num_hist ^ 2) (instead of linear) if I remember correctly. Also, if we want to spend more time on improving the algorithms, I would prefer something that has some known guarantees rather than something completely novel.

Reynold Xin added a comment - 14/Jul/16 20:26

We just need a function, and doesn't need it to be identical to Hive's result.

Reynold Xin added a comment - 14/Jul/16 20:26 We just need a function, and doesn't need it to be identical to Hive's result.

Liwei Lin(Inactive) added a comment - 15/Jul/16 00:01

Thanks for the clarification. I'm working on this one, thanks!

Liwei Lin(Inactive) added a comment - 15/Jul/16 00:01 Thanks for the clarification. I'm working on this one, thanks!

Apache Spark added a comment - 17/Jul/16 12:56

User 'lw-lin' has created a pull request for this issue:
https://github.com/apache/spark/pull/14237

Apache Spark added a comment - 17/Jul/16 12:56 User 'lw-lin' has created a pull request for this issue: https://github.com/apache/spark/pull/14237

Apache Spark added a comment - 21/Jul/16 08:54

User 'lw-lin' has created a pull request for this issue:
https://github.com/apache/spark/pull/14237

Apache Spark added a comment - 21/Jul/16 08:54 User 'lw-lin' has created a pull request for this issue: https://github.com/apache/spark/pull/14237

Apache Spark added a comment - 21/Jul/16 08:55

User 'lw-lin' has created a pull request for this issue:
https://github.com/apache/spark/pull/14298

Apache Spark added a comment - 21/Jul/16 08:55 User 'lw-lin' has created a pull request for this issue: https://github.com/apache/spark/pull/14298

Apache Spark added a comment - 21/Jul/16 12:36

User 'lw-lin' has created a pull request for this issue:
https://github.com/apache/spark/pull/14237

Apache Spark added a comment - 21/Jul/16 12:36 User 'lw-lin' has created a pull request for this issue: https://github.com/apache/spark/pull/14237

Sean Zhong added a comment - 22/Aug/16 16:35 - edited

Created a sub-task ~~SPARK-17188~~ to move QuantileSummaries to package org.apache.spark.sql.util of catalyst project

Sean Zhong added a comment - 22/Aug/16 16:35 - edited Created a sub-task SPARK-17188 to move QuantileSummaries to package org.apache.spark.sql.util of catalyst project

Apache Spark added a comment - 31/Aug/16 01:23

User 'clockfly' has created a pull request for this issue:
https://github.com/apache/spark/pull/14868

Apache Spark added a comment - 31/Aug/16 01:23 User 'clockfly' has created a pull request for this issue: https://github.com/apache/spark/pull/14868

Wenchen Fan added a comment - 01/Sep/16 08:32

Issue resolved by pull request 14868
https://github.com/apache/spark/pull/14868

Wenchen Fan added a comment - 01/Sep/16 08:32 Issue resolved by pull request 14868 https://github.com/apache/spark/pull/14868

Apache Spark added a comment - 07/Nov/16 03:46

User 'lw-lin' has created a pull request for this issue:
https://github.com/apache/spark/pull/14237

Apache Spark added a comment - 07/Nov/16 03:46 User 'lw-lin' has created a pull request for this issue: https://github.com/apache/spark/pull/14237

chenerlu added a comment - 08/Mar/17 11:42 - edited

Hi, I am little confused about percentile_approx, is it different from hive's now ? will we get different result when the input is same ?

for example, I run select percentile_approx(c4_double,array(0.1,0.2,0.3,0.4)) from test; and get different result.

c4_double is show below:
1.00000001
2.00000001
3.00000001
4.00000001
5.00000001
6.00000001
7.00000001
8.00000001
9.00000001
NULL
-8.952
-96.0

Hive:
[-87.2952,-6.961599997999999,1.3000000099999998,2.4000000100000003]

spark 2.x:
[-8.952,1.00000001,2.00000001,3.00000001]

so which result is right ? Could you pls reply me when you are free.

rxin lwlin

chenerlu added a comment - 08/Mar/17 11:42 - edited Hi, I am little confused about percentile_approx, is it different from hive's now ? will we get different result when the input is same ? for example, I run select percentile_approx(c4_double,array(0.1,0.2,0.3,0.4)) from test; and get different result. c4_double is show below: 1.00000001 2.00000001 3.00000001 4.00000001 5.00000001 6.00000001 7.00000001 8.00000001 9.00000001 NULL -8.952 -96.0 Hive: [-87.2952,-6.961599997999999,1.3000000099999998,2.4000000100000003] spark 2.x: [-8.952,1.00000001,2.00000001,3.00000001] so which result is right ? Could you pls reply me when you are free. rxin lwlin

Zhenhua Wang added a comment - 10/Mar/17 04:08 - edited

erlu I think it's been made clear from the above discussions, Spark's result doesn't have to be the same as Hive's result.

Zhenhua Wang added a comment - 10/Mar/17 04:08 - edited erlu I think it's been made clear from the above discussions, Spark's result doesn't have to be the same as Hive's result.

People

Assignee:: Sean Zhong

Reporter:: Reynold Xin

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 29/Jun/16 04:55

Updated:: 11/Dec/17 10:53

Resolved:: 01/Sep/16 08:32