Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
None
-
None
Description
Currently parquet-format specifies the sort order for floating point numbers as follows:
* FLOAT - signed comparison of the represented value * DOUBLE - signed comparison of the represented value
The problem is that the comparison of floating point numbers is only a partial ordering with strange behaviour in specific corner cases. For example, according to IEEE 754, -0 is neither less nor more than +0 and comparing NaN to anything always returns false. This ordering is not suitable for statistics. Additionally, the Java implementation already uses a different (total) ordering that handles these cases correctly but differently than the C++ implementations, which leads to interoperability problems.
TypeDefinedOrder for doubles and floats should be deprecated and a new TotalFloatingPointOrder should be introduced. The default for writing doubles and floats would be the new TotalFloatingPointOrder. This ordering should be effective and easy to implement in all programming languages.
Attachments
Issue Links
- Blocked
-
IMPALA-7304 Impala shouldn't write column indexes for float columns until PARQUET-1222 is resolved
- Resolved
- blocks
-
IMPALA-6539 Implement specification-compliant floating point comparison
- Open
-
PARQUET-1223 [parquet-mr] Implement specification-compliant floating point comparison
- Open
-
PARQUET-1224 [C++] Implement specification-compliant floating point comparison
- Resolved
- is related to
-
ARROW-12264 [C++][Dataset] Handle NaNs correctly in Parquet predicate push-down
- In Progress
-
PARQUET-1246 Ignore float/double statistics in case of NaN
- Resolved
-
PARQUET-1251 Clarify ambiguous min/max stats for FLOAT/DOUBLE
- Closed
Activity
I do not think the order proposed matches the IEEE 754 totalOrder. It would be good to match that order so as to avoid creating yet one more thing for users to track. Also, it is possible that the IEEE totalOrder is also faster in some hardware.
I haven't checked, but it is possible that IEEE-754 is actually a lexicographical compare of the bit representations in some endianess or another.
jbapple The proposed order was intentionally different from IEEE 754 totalOrder, because of the following rules of totalOrder:
3) If x and y represent the same floating-point datum: i) If x and y have negative sign, totalOrder(x, y) is true if and only if the exponent of x ≥ the exponent of y ii) otherwise totalOrder(x, y) is true if and only if the exponent of x ≤ the exponent of y.
This has led me to believe that different bit patterns may correspond to the same numeric value, in which case considering them different would be problematic, as it could lead to row groups being dropped or not based on what bit pattern was used to save the data and what bit pattern is used for looking them up. However, thinking more about it, it seems to me that apart from +0/-0 and the various NaNs, all other bit patterns correspond to different numbers. Since the opposite assumption was the reason for the proposed order, I am removing it for now.
I think Jim is right. IEEE-754 numbers are ordered correctly if you flip the sign bit and use unsigned, byte-wise comparison. I wrote a spec for encoding HBase keys that used this a while ago.
The reason why rule 3 works is that for normal floating point numbers, the significand must start with a 1. Conceptually, this means that 0.0001 and 0.00001 cannot be represented with the same exponent. Because the exponent basically encodes where the first set bit is in the number, it can be used for sorting. There is also support for very small numbers where the significand doesn't start with 1, but those must use the smallest-possible exponent so sorting still works.
My understanding is that positive numbers are ordered like integers with the same bits, but negative numbers are ordered in the reverse order than integers with the same bits. This is similar to how decimal numbers are ordered, due to consisting of a sign and an absolute value.
Having said that, I also agree that we could use this ordering. However, I think we should define it in terms of how it should behave (probably with lots of examples), instead of just pointing to the IEEE 754 specification, which is not available openly.
I updated this JIRA to distuingish it from PARQUET-1251. To summarize:
PARQUET-1251is a "hotfix" that describes a workaround for handling statistics written using the ambiguous specification.- This JIRA is about specifying a well-defined sort order.
I'll note that Parquet C++ now has the following behaviour:
- signed zeros are properly ordered (
ARROW-5562) - NaNs are ignored when computing min/max (
PARQUET-1225); if a page or column chunk only has NaNs, the statistics are unset
apitrou, I guess what you've described is the write path of the statistics. Because you cannot control other writers I would suggest following the spec for the read path.
Meanwhile, I've done some investigation in the parquet-mr code and the format and there are issues related to this topic.
- We have created the ColumnOrder object and the related field in the format to specify the ordering of the columns and to prepare for the potential solution of this (and similar) issues. We are referencing this field in the Statistics object used for row-group level stats. Meanwhile, we do not reference this in the column indexes. So, in column indexes it is not clear what sorting orders do we want to use and how to handle cases like this. How it is implemented in parquet-cpp?
- Based on the referenced workaround we handle the special floating point values at row-group level in parquet-mr but only for the read path. For the write path we still write these values.
- For column indexes we handle these values but only for the write path and not for the read path.
So, we have a couple of issues around this topic and it would be great if we would have a final and well defined solution for it.
Some answers after looking through the code:
- parquet-cpp does not read nor write ColumnIndex
- our handling of min_value and max_value on the read path is naive. We use the same comparisons regardless of whether ColumnOrder is present or not. In particular, we use native type-specific greater-or-equal comparison (e.g. floating-point comparison), which is due to fail with NaNs (but will succeed with signed zeros).
I'd propose the following "fix":
- Add a new optional bool value to the statistics struct "contains_nan". When unset, I think we specify the semantics for comparisons relative to -0.0/0.0 and NaN, etc are not well defined and implementations have taken different routes.
- When set, if true, it means the column contains at least one NaN, when set to false it means no NaNs are present. Further when set, it implies the following ordering:
NaNs are never included in Min/Max statistics in the struct. -0.0, +0.0, are considered two distinct values and are ordered according to sign.
Thoughts? Should I bring this up on the mailing list?
emkornfield, I think we do not need to handle NaN values with a boolean to fix this issue. NaN is kind of similar than null values so we may even count them instead of having a boolean but this question is not tightly related to this topic.
What do you think about elevating the current suggestion in the thrift file to specification level for writing/reading FP min/max values?
Because the sorting order is not specified properly for floating point values (relations vs. total ordering) the following compatibility rules should be applied when reading statistics:
- If the min is a NaN, it should be ignored.
- If the max is a NaN, it should be ignored.
- If the min is +0, the row group may contain -0 values as well.
- If the max is -0, the row group may contain +0 values as well.
- When looking for NaN values, min and max should be ignored.
For writing we shall skip NaN values and use -0 for min and +0 for max any time when a 0 is to be taken into account.
With this solution we cannot do anything clever in case of searching for a NaN but it can be fixed separately. And we also need to double-check whether we really ignore the min/max stats in case of searching for a NaN.
I think it is a good idea to discuss such topics on the mailing list. However, we should also time-box the discussion and go forward with a proposed solution if there are no interests on the mailing list. (Personally, I do not follow the dev list anymore.)
(side note: the ML is mostly a firehose of notifications nowadays, which doesn't make it easy to follow...)
I agree with gszadovszky for elevating these rules at the specification level.
Elevating the specification level seems fine. I was under the impression the thrift file was the specification? Where do we need to do the PR to elevate them?
There are a couple of docs in the parquet-format repo. The related ones are [about logical types|https://github.com/apache/parquet-format/blob/master/LogicalTypes.md] and the main one that contains the description of the primitive types. Unfortunately, the latter one does not contain anything about sorting order.
So, I think, we need to do the following:
- Define the sorting order for the primitive types or reference the logical types description for it. (In most cases it would be referencing since the ordering depends on the related logical types e.g. signed/unsigned sorting of integral types)
- After defining the sorting order of the primitive floating point numbers based on what we've discussed above reference it from the new half-precision FP logical type.
(Another unfortunate thing is that we have some specification-like docs at the parquet site as well. I think we should propagate the parquet-format docs to there automatically or simply link them from the site. But it is clearly a different topic.)
emkornfield opened a new pull request, #185:
URL: https://github.com/apache/parquet-format/pull/185
Make sure you have checked all steps below.
-
-
- Jira
-
- [x ] My PR addresses the following [Parquet Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references them in the PR title. For example, "
PARQUET-1234: My Parquet PR" - https://issues.apache.org/jira/browse/PARQUET-XXX
- In case you are adding a dependency, check if the license complies with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x).
-
-
- Commits
-
- [ ] My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)":
1. Subject is separated from body by a blank line
1. Subject is limited to 50 characters (not including Jira issue reference)
1. Subject does not end with a period
1. Subject uses the imperative mood ("add", not "adding")
1. Body wraps at 72 characters
1. Body explains "what" and "why", not "how"
emkornfield commented on PR #185:
URL: https://github.com/apache/parquet-format/pull/185#issuecomment-1307760898
@pitrou @gszadovszky
emkornfield commented on PR #185:
URL: https://github.com/apache/parquet-format/pull/185#issuecomment-1338880938
@gszadovszky thanks for the feedback I tried to address it in the latest commit.
pitrou commented on code in PR #185:
URL: https://github.com/apache/parquet-format/pull/185#discussion_r1041067303
##########
README.md:
##########
@@ -144,6 +144,38 @@ documented in [LogicalTypes.md][logical-types].
[logical-types]: LogicalTypes.md
+### Sort Order
+
+Parquet stores min/max statistics at several levels (e.g. RowGroup, Page Index,
+etc). Comparison for values of a type follow the following logic:
Review Comment:
```suggestion
Parquet stores min/max statistics at several levels (such as Column Chunk,
Column Index and Data Page). Comparison for values of a type obey the
following rules:
```
##########
README.md:
##########
@@ -144,6 +144,38 @@ documented in [LogicalTypes.md][logical-types].
[logical-types]: LogicalTypes.md
+### Sort Order
+
+Parquet stores min/max statistics at several levels (e.g. RowGroup, Page Index,
+etc). Comparison for values of a type follow the following logic:
+
+1. Each logical type has a specified comparison order. If a column is
+ annotated with an unknown logical type, statistics may not be used
+ for pruning data. The sort order for logical types is documented in
+ the [LogicalTypes.md][logical-types] page.
+2. For primitives the following sort orders apply:
+
+ * BOOLEAN - false, true
+ * INT32, INT64, FLOAT, DOUBLE - Signed comparison. Floating point values are
+ not totally ordered due to special case like NaN. They require special
+ handling when reading statistics. The details are documented in parquet.thrift in the
+ `ColumnOrder` union. They are summarized
Review Comment:
Need to synchronize this with the final wording from `parquet.thrift`.
##########
README.md:
##########
@@ -144,6 +144,38 @@ documented in [LogicalTypes.md][logical-types].
[logical-types]: LogicalTypes.md
+### Sort Order
+
+Parquet stores min/max statistics at several levels (e.g. RowGroup, Page Index,
+etc). Comparison for values of a type follow the following logic:
+
+1. Each logical type has a specified comparison order. If a column is
+ annotated with an unknown logical type, statistics may not be used
+ for pruning data. The sort order for logical types is documented in
+ the [LogicalTypes.md][logical-types] page.
+2. For primitives the following sort orders apply:
Review Comment:
```suggestion
2. For primitive types, the following rules apply:
```
##########
src/main/thrift/parquet.thrift:
##########
@@ -902,6 +902,13 @@ union ColumnOrder {
- - If the min is +0, the row group may contain -0 values as well.
- - If the max is -0, the row group may contain +0 values as well.
- - When looking for NaN values, min and max should be ignored.
+ *
+ * When writing statistics the following rules should be followed:
+ * - NaNs should not be written to min or max statistics fields.
+ * - Only -0 should be written into min statistics fields (if only
+ * +0 is present in the column it should be converted to -0.0).
+ * - Only +0 should be written into a max statistics fields (if
+ * only -0 is present it must be convereted to +0).
Review Comment:
Suggestion to make wording clearer.
```suggestion
- - If the computed max value is zero (whether negative or positive),
- `+0.0` should be written into the max statistics field.
- - If the computed min value is zero (whether negative or positive),
- `-0.0` should be written into the min statistics field.
```
##########
README.md:
##########
@@ -144,6 +144,38 @@ documented in [LogicalTypes.md][logical-types].
[logical-types]: LogicalTypes.md
+### Sort Order
+
+Parquet stores min/max statistics at several levels (e.g. RowGroup, Page Index,
+etc). Comparison for values of a type follow the following logic:
+
+1. Each logical type has a specified comparison order. If a column is
+ annotated with an unknown logical type, statistics may not be used
+ for pruning data. The sort order for logical types is documented in
+ the [LogicalTypes.md][logical-types] page.
+2. For primitives the following sort orders apply:
+
+ * BOOLEAN - false, true
+ * INT32, INT64, FLOAT, DOUBLE - Signed comparison. Floating point values are
Review Comment:
I would suggest making a separate list item for floating-point:
```suggestion
- INT32, INT64 - Signed comparison.
- FLOAT, DOUBLE - Signed comparison with special handling of NaNs
and signed zeros. The details are documented in...
```
emkornfield commented on code in PR #185:
URL: https://github.com/apache/parquet-format/pull/185#discussion_r1041775669
##########
README.md:
##########
@@ -144,6 +144,38 @@ documented in [LogicalTypes.md][logical-types].
[logical-types]: LogicalTypes.md
+### Sort Order
+
+Parquet stores min/max statistics at several levels (e.g. RowGroup, Page Index,
+etc). Comparison for values of a type follow the following logic:
+
+1. Each logical type has a specified comparison order. If a column is
+ annotated with an unknown logical type, statistics may not be used
+ for pruning data. The sort order for logical types is documented in
+ the [LogicalTypes.md][logical-types] page.
+2. For primitives the following sort orders apply:
+
+ * BOOLEAN - false, true
+ * INT32, INT64, FLOAT, DOUBLE - Signed comparison. Floating point values are
+ not totally ordered due to special case like NaN. They require special
+ handling when reading statistics. The details are documented in parquet.thrift in the
+ `ColumnOrder` union. They are summarized
Review Comment:
done.
emkornfield commented on PR #185:
URL: https://github.com/apache/parquet-format/pull/185#issuecomment-1340395605
@gszadovszky @pitrou thanks for the review. I believe I incorporated the remaining feedback.
pitrou merged PR #185:
URL: https://github.com/apache/parquet-format/pull/185
pitrou commented on PR #185:
URL: https://github.com/apache/parquet-format/pull/185#issuecomment-1340540614
Thanks @emkornfield !
pitrou commented on PR #185:
URL: https://github.com/apache/parquet-format/pull/185#issuecomment-1341327375
> @pitrou thanks for merging, I'm not sure if we needed an official vote on this?
Oops, that's a good question. I might have merged too quickly. @gszadovszky What is your take on this?
gszadovszky commented on PR #185:
URL: https://github.com/apache/parquet-format/pull/185#issuecomment-1341331959
I don't feel it would require a formal vote. In my view it was more a missing part of the spec that is made clear.
It might worth a heads up on the dev list, though.
emkornfield commented on PR #185:
URL: https://github.com/apache/parquet-format/pull/185#issuecomment-1341349709
> I don't feel it would require a formal vote. In my view it was more a missing part of the spec that is made clear.
It might worth a heads up on the dev list, though.
Replied to my original thread that this has been merged.
This issue has been migrated to issue #342 on GitHub. Please see the migration documentation for further details.
Moving the design alternatives to this comment so that the JIRA description focuses on the proposed solution:
We should explicitly require implementations to follow a specific comparison logic for these types. The candidates are:
An additional problem is how to deal with existing data: