Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-1222

Specify a well-defined sorting order for float and double types

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • None
    • format-2.10.0
    • parquet-format
    • None

    Description

      Currently parquet-format specifies the sort order for floating point numbers as follows:

         *   FLOAT - signed comparison of the represented value
         *   DOUBLE - signed comparison of the represented value
      

      The problem is that the comparison of floating point numbers is only a partial ordering with strange behaviour in specific corner cases. For example, according to IEEE 754, -0 is neither less nor more than +0 and comparing NaN to anything always returns false. This ordering is not suitable for statistics. Additionally, the Java implementation already uses a different (total) ordering that handles these cases correctly but differently than the C++ implementations, which leads to interoperability problems.

      TypeDefinedOrder for doubles and floats should be deprecated and a new TotalFloatingPointOrder should be introduced. The default for writing doubles and floats would be the new TotalFloatingPointOrder. This ordering should be effective and easy to implement in all programming languages.

      Attachments

        Issue Links

          Activity

            zi Zoltan Ivanfi added a comment - - edited

            Moving the design alternatives to this comment so that the JIRA description focuses on the proposed solution:

            We should explicitly require implementations to follow a specific comparison logic for these types. The candidates are:

            • The Java implementation which looks easy and efficient to implement in any language.
            • The IEEE 754 totalOrder predicate which is rather complicated.
            • The IEEE 754-2008 min and max operations which may be hard to use for comparison, so components could not use the same sorting order to achieve the smallest possible min/max ranges (although a regular sort would probably result in an almost optimal value order).
            • We could simply require NaNs to be ignored for calculating min/max. However, we should also explicitly address -0/+0 values in this case, which probably leads to the option above.

            An additional problem is how to deal with existing data:

            • One possibility is to specify legacy rules, like "if the min or max is NaN, it should be ignored" or that "-0 and +0 should be considered equal for min/max purposes".
            • Another alternative is to deprecate `min_value` and `max_value` and introduce `yet_another_min` and `yet_another_max` fields instead (with nicer names, naturally). This could be combined with some legacy rules for the old field.
            • Probably the best solution would be to deprecate TypeDefinedOrder for doubles and floats and introduce a new TotalOrder. The legacy rule "if the min or max is NaN, it should be ignored" should apply to TypeDefinedOrder while the new TotalOrder would not have such restrictions. The default for writing doubles and floats would be the new TotalOrder.
            zi Zoltan Ivanfi added a comment - - edited Moving the design alternatives to this comment so that the JIRA description focuses on the proposed solution: We should explicitly require implementations to follow a specific comparison logic for these types. The candidates are: The Java implementation which looks easy and efficient to implement in any language. The IEEE 754 totalOrder predicate which is rather complicated. The IEEE 754-2008 min and max operations which may be hard to use for comparison, so components could not use the same sorting order to achieve the smallest possible min/max ranges (although a regular sort would probably result in an almost optimal value order). We could simply require NaNs to be ignored for calculating min/max. However, we should also explicitly address -0/+0 values in this case, which probably leads to the option above. An additional problem is how to deal with existing data: One possibility is to specify legacy rules, like "if the min or max is NaN, it should be ignored" or that "-0 and +0 should be considered equal for min/max purposes". Another alternative is to deprecate `min_value` and `max_value` and introduce `yet_another_min` and `yet_another_max` fields instead (with nicer names, naturally). This could be combined with some legacy rules for the old field. Probably the best solution would be to deprecate TypeDefinedOrder for doubles and floats and introduce a new TotalOrder. The legacy rule "if the min or max is NaN, it should be ignored" should apply to TypeDefinedOrder while the new TotalOrder would not have such restrictions. The default for writing doubles and floats would be the new TotalOrder.
            jbapple Jim Apple added a comment -

            I do not think the order proposed matches the IEEE 754 totalOrder. It would be good to match that order so as to avoid creating yet one more thing for users to track. Also, it is possible that the IEEE totalOrder is also faster in some hardware.

            I haven't checked, but it is possible that IEEE-754 is actually a lexicographical compare of the bit representations in some endianess or another.

            jbapple Jim Apple added a comment - I do not think the order proposed matches the IEEE 754 totalOrder. It would be good to match that order so as to avoid creating yet one more thing for users to track. Also, it is possible that the IEEE totalOrder is also faster in some hardware. I haven't checked, but it is possible that IEEE-754 is actually a lexicographical compare of the bit representations in some endianess or another.
            zi Zoltan Ivanfi added a comment -

            jbapple The proposed order was intentionally different from IEEE 754 totalOrder, because of the following rules of totalOrder:

                    3) If x and y represent the same floating-point datum:
                        i) If x and y have negative sign,
                        totalOrder(x, y) is true if and only if the exponent of x ≥ the exponent of y
                        ii) otherwise
                        totalOrder(x, y) is true if and only if the exponent of x ≤ the exponent of y.
            

            This has led me to believe that different bit patterns may correspond to the same numeric value, in which case considering them different would be problematic, as it could lead to row groups being dropped or not based on what bit pattern was used to save the data and what bit pattern is used for looking them up. However, thinking more about it, it seems to me that apart from +0/-0 and the various NaNs, all other bit patterns correspond to different numbers. Since the opposite assumption was the reason for the proposed order, I am removing it for now.

            zi Zoltan Ivanfi added a comment - jbapple The proposed order was intentionally different from IEEE 754 totalOrder, because of the following rules of totalOrder: 3) If x and y represent the same floating-point datum: i) If x and y have negative sign, totalOrder(x, y) is true if and only if the exponent of x ≥ the exponent of y ii) otherwise totalOrder(x, y) is true if and only if the exponent of x ≤ the exponent of y. This has led me to believe that different bit patterns may correspond to the same numeric value, in which case considering them different would be problematic, as it could lead to row groups being dropped or not based on what bit pattern was used to save the data and what bit pattern is used for looking them up. However, thinking more about it, it seems to me that apart from +0/-0 and the various NaNs, all other bit patterns correspond to different numbers. Since the opposite assumption was the reason for the proposed order, I am removing it for now.
            rdblue Ryan Blue added a comment -

            I think Jim is right. IEEE-754 numbers are ordered correctly if you flip the sign bit and use unsigned, byte-wise comparison. I wrote a spec for encoding HBase keys that used this a while ago.

            The reason why rule 3 works is that for normal floating point numbers, the significand must start with a 1. Conceptually, this means that 0.0001 and 0.00001 cannot be represented with the same exponent. Because the exponent basically encodes where the first set bit is in the number, it can be used for sorting. There is also support for very small numbers where the significand doesn't start with 1, but those must use the smallest-possible exponent so sorting still works.

            rdblue Ryan Blue added a comment - I think Jim is right. IEEE-754 numbers are ordered correctly if you flip the sign bit and use unsigned, byte-wise comparison. I wrote a spec for encoding HBase keys that used this a while ago. The reason why rule 3 works is that for normal floating point numbers, the significand must start with a 1. Conceptually, this means that 0.0001 and 0.00001 cannot be represented with the same exponent. Because the exponent basically encodes where the first set bit is in the number, it can be used for sorting. There is also support for very small numbers where the significand doesn't start with 1, but those must use the smallest-possible exponent so sorting still works.
            zi Zoltan Ivanfi added a comment -

            My understanding is that positive numbers are ordered like integers with the same bits, but negative numbers are ordered in the reverse order than integers with the same bits. This is similar to how decimal numbers are ordered, due to consisting of a sign and an absolute value.

            zi Zoltan Ivanfi added a comment - My understanding is that positive numbers are ordered like integers with the same bits, but negative numbers are ordered in the reverse order than integers with the same bits. This is similar to how decimal numbers are ordered, due to consisting of a sign and an absolute value.
            zi Zoltan Ivanfi added a comment -

            Having said that, I also agree that we could use this ordering. However, I think we should define it in terms of how it should behave (probably with lots of examples), instead of just pointing to the IEEE 754 specification, which is not available openly.

            zi Zoltan Ivanfi added a comment - Having said that, I also agree that we could use this ordering. However, I think we should define it in terms of how it should behave (probably with lots of examples), instead of just pointing to the IEEE 754 specification, which is not available openly.
            zi Zoltan Ivanfi added a comment -

            I updated this JIRA to distuingish it from PARQUET-1251. To summarize:

            • PARQUET-1251 is a "hotfix" that describes a workaround for handling statistics written using the ambiguous specification.
            • This JIRA is about specifying a well-defined sort order.
            zi Zoltan Ivanfi added a comment - I updated this JIRA to distuingish it from PARQUET-1251 . To summarize: PARQUET-1251 is a "hotfix" that describes a workaround for handling statistics written using the ambiguous specification. This JIRA is about specifying a well-defined sort order.
            apitrou Antoine Pitrou added a comment -

            I'll note that Parquet C++ now has the following behaviour:

            • signed zeros are properly ordered (ARROW-5562)
            • NaNs are ignored when computing min/max (PARQUET-1225); if a page or column chunk only has NaNs, the statistics are unset
            apitrou Antoine Pitrou added a comment - I'll note that Parquet C++ now has the following behaviour: signed zeros are properly ordered ( ARROW-5562 ) NaNs are ignored when computing min/max ( PARQUET-1225 ); if a page or column chunk only has NaNs, the statistics are unset

            apitrou, I guess what you've described is the write path of the statistics. Because you cannot control other writers I would suggest following the spec for the read path.
            Meanwhile, I've done some investigation in the parquet-mr code and the format and there are issues related to this topic.

            • We have created the ColumnOrder object and the related field in the format to specify the ordering of the columns and to prepare for the potential solution of this (and similar) issues. We are referencing this field in the Statistics object used for row-group level stats. Meanwhile, we do not reference this in the column indexes. So, in column indexes it is not clear what sorting orders do we want to use and how to handle cases like this. How it is implemented in parquet-cpp?
            • Based on the referenced workaround we handle the special floating point values at row-group level in parquet-mr but only for the read path. For the write path we still write these values.
            • For column indexes we handle these values but only for the write path and not for the read path.

            So, we have a couple of issues around this topic and it would be great if we would have a final and well defined solution for it.

            gszadovszky Gabor Szadovszky added a comment - apitrou , I guess what you've described is the write path of the statistics. Because you cannot control other writers I would suggest following the spec for the read path . Meanwhile, I've done some investigation in the parquet-mr code and the format and there are issues related to this topic. We have created the ColumnOrder object and the related field in the format to specify the ordering of the columns and to prepare for the potential solution of this (and similar) issues. We are referencing this field in the Statistics object used for row-group level stats. Meanwhile, we do not reference this in the column indexes. So, in column indexes it is not clear what sorting orders do we want to use and how to handle cases like this. How it is implemented in parquet-cpp? Based on the referenced workaround we handle the special floating point values at row-group level in parquet-mr but only for the read path. For the write path we still write these values. For column indexes we handle these values but only for the write path and not for the read path. So, we have a couple of issues around this topic and it would be great if we would have a final and well defined solution for it.
            apitrou Antoine Pitrou added a comment -

            Some answers after looking through the code:

            • parquet-cpp does not read nor write ColumnIndex
            • our handling of min_value and max_value on the read path is naive. We use the same comparisons regardless of whether ColumnOrder is present or not. In particular, we use native type-specific greater-or-equal comparison (e.g. floating-point comparison), which is due to fail with NaNs (but will succeed with signed zeros).
            apitrou Antoine Pitrou added a comment - Some answers after looking through the code: parquet-cpp does not read nor write ColumnIndex our handling of min_value and max_value on the read path is naive. We use the same comparisons regardless of whether ColumnOrder is present or not. In particular, we use native type-specific greater-or-equal comparison (e.g. floating-point comparison), which is due to fail with NaNs (but will succeed with signed zeros).

            I'd propose the following "fix":

            • Add a new optional bool value to the statistics struct "contains_nan". When unset, I think we specify the semantics for comparisons relative to -0.0/0.0 and NaN, etc are not well defined and implementations have taken different routes.
            • When set, if true, it means the column contains at least one NaN, when set to false it means no NaNs are present. Further when set, it implies the following ordering:
              NaNs are never included in Min/Max statistics in the struct. -0.0, +0.0, are considered two distinct values and are ordered according to sign.

            Thoughts? Should I bring this up on the mailing list?

            emkornfield Micah Kornfield added a comment - I'd propose the following "fix": Add a new optional bool value to the statistics struct "contains_nan". When unset, I think we specify the semantics for comparisons relative to -0.0/0.0 and NaN, etc are not well defined and implementations have taken different routes. When set, if true, it means the column contains at least one NaN, when set to false it means no NaNs are present. Further when set, it implies the following ordering: NaNs are never included in Min/Max statistics in the struct. -0.0, +0.0, are considered two distinct values and are ordered according to sign. Thoughts? Should I bring this up on the mailing list?

            emkornfield, I think we do not need to handle NaN values with a boolean to fix this issue. NaN is kind of similar than null values so we may even count them instead of having a boolean but this question is not tightly related to this topic.
            What do you think about elevating the current suggestion in the thrift file to specification level for writing/reading FP min/max values?

            Because the sorting order is not specified properly for floating point values (relations vs. total ordering) the following compatibility rules should be applied when reading statistics:

            • If the min is a NaN, it should be ignored.
            • If the max is a NaN, it should be ignored.
            • If the min is +0, the row group may contain -0 values as well.
            • If the max is -0, the row group may contain +0 values as well.
            • When looking for NaN values, min and max should be ignored.

            For writing we shall skip NaN values and use -0 for min and +0 for max any time when a 0 is to be taken into account.

            With this solution we cannot do anything clever in case of searching for a NaN but it can be fixed separately. And we also need to double-check whether we really ignore the min/max stats in case of searching for a NaN.

            I think it is a good idea to discuss such topics on the mailing list. However, we should also time-box the discussion and go forward with a proposed solution if there are no interests on the mailing list. (Personally, I do not follow the dev list anymore.)

            gszadovszky Gabor Szadovszky added a comment - emkornfield , I think we do not need to handle NaN values with a boolean to fix this issue. NaN is kind of similar than null values so we may even count them instead of having a boolean but this question is not tightly related to this topic. What do you think about elevating the current suggestion in the thrift file to specification level for writing/reading FP min/max values? Because the sorting order is not specified properly for floating point values (relations vs. total ordering) the following compatibility rules should be applied when reading statistics: If the min is a NaN, it should be ignored. If the max is a NaN, it should be ignored. If the min is +0, the row group may contain -0 values as well. If the max is -0, the row group may contain +0 values as well. When looking for NaN values, min and max should be ignored. For writing we shall skip NaN values and use -0 for min and +0 for max any time when a 0 is to be taken into account. With this solution we cannot do anything clever in case of searching for a NaN but it can be fixed separately. And we also need to double-check whether we really ignore the min/max stats in case of searching for a NaN. I think it is a good idea to discuss such topics on the mailing list. However, we should also time-box the discussion and go forward with a proposed solution if there are no interests on the mailing list. (Personally, I do not follow the dev list anymore.)
            apitrou Antoine Pitrou added a comment -

            (side note: the ML is mostly a firehose of notifications nowadays, which doesn't make it easy to follow...)

            apitrou Antoine Pitrou added a comment - (side note: the ML is mostly a firehose of notifications nowadays, which doesn't make it easy to follow...)
            apitrou Antoine Pitrou added a comment -

            I agree with gszadovszky for elevating these rules at the specification level.

            apitrou Antoine Pitrou added a comment - I agree with gszadovszky for elevating these rules at the specification level.

            Elevating the specification level seems fine. I was under the impression the thrift file was the specification? Where do we need to do the PR to elevate them?

            emkornfield Micah Kornfield added a comment - Elevating the specification level seems fine. I was under the impression the thrift file was the specification? Where do we need to do the PR to elevate them?

            emkornfield,

            There are a couple of docs in the parquet-format repo. The related ones are [about logical types|https://github.com/apache/parquet-format/blob/master/LogicalTypes.md] and the main one that contains the description of the primitive types. Unfortunately, the latter one does not contain anything about sorting order.
            So, I think, we need to do the following:

            • Define the sorting order for the primitive types or reference the logical types description for it. (In most cases it would be referencing since the ordering depends on the related logical types e.g. signed/unsigned sorting of integral types)
            • After defining the sorting order of the primitive floating point numbers based on what we've discussed above reference it from the new half-precision FP logical type.

            (Another unfortunate thing is that we have some specification-like docs at the parquet site as well. I think we should propagate the parquet-format docs to there automatically or simply link them from the site. But it is clearly a different topic.)

            gszadovszky Gabor Szadovszky added a comment - emkornfield , There are a couple of docs in the parquet-format repo. The related ones are [about logical types| https://github.com/apache/parquet-format/blob/master/LogicalTypes.md ] and the main one that contains the description of the primitive types . Unfortunately, the latter one does not contain anything about sorting order. So, I think, we need to do the following: Define the sorting order for the primitive types or reference the logical types description for it. (In most cases it would be referencing since the ordering depends on the related logical types e.g. signed/unsigned sorting of integral types) After defining the sorting order of the primitive floating point numbers based on what we've discussed above reference it from the new half-precision FP logical type. (Another unfortunate thing is that we have some specification-like docs at the parquet site as well. I think we should propagate the parquet-format docs to there automatically or simply link them from the site. But it is clearly a different topic.)
            githubbot ASF GitHub Bot added a comment -

            emkornfield opened a new pull request, #185:
            URL: https://github.com/apache/parquet-format/pull/185

            Make sure you have checked all steps below.

                1. Jira
                1. Commits
            • [ ] My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)":
              1. Subject is separated from body by a blank line
              1. Subject is limited to 50 characters (not including Jira issue reference)
              1. Subject does not end with a period
              1. Subject uses the imperative mood ("add", not "adding")
              1. Body wraps at 72 characters
              1. Body explains "what" and "why", not "how"
            githubbot ASF GitHub Bot added a comment - emkornfield opened a new pull request, #185: URL: https://github.com/apache/parquet-format/pull/185 Make sure you have checked all steps below. Jira [x ] My PR addresses the following [Parquet Jira] ( https://issues.apache.org/jira/browse/PARQUET/ ) issues and references them in the PR title. For example, " PARQUET-1234 : My Parquet PR" https://issues.apache.org/jira/browse/PARQUET-XXX In case you are adding a dependency, check if the license complies with the [ASF 3rd Party License Policy] ( https://www.apache.org/legal/resolved.html#category-x ). Commits [ ] My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from " [How to write a good git commit message] ( http://chris.beams.io/posts/git-commit/ )": 1. Subject is separated from body by a blank line 1. Subject is limited to 50 characters (not including Jira issue reference) 1. Subject does not end with a period 1. Subject uses the imperative mood ("add", not "adding") 1. Body wraps at 72 characters 1. Body explains "what" and "why", not "how"
            githubbot ASF GitHub Bot added a comment -

            emkornfield commented on PR #185:
            URL: https://github.com/apache/parquet-format/pull/185#issuecomment-1307760898

            @pitrou @gszadovszky

            githubbot ASF GitHub Bot added a comment - emkornfield commented on PR #185: URL: https://github.com/apache/parquet-format/pull/185#issuecomment-1307760898 @pitrou @gszadovszky
            githubbot ASF GitHub Bot added a comment -

            emkornfield commented on PR #185:
            URL: https://github.com/apache/parquet-format/pull/185#issuecomment-1338880938

            @gszadovszky thanks for the feedback I tried to address it in the latest commit.

            githubbot ASF GitHub Bot added a comment - emkornfield commented on PR #185: URL: https://github.com/apache/parquet-format/pull/185#issuecomment-1338880938 @gszadovszky thanks for the feedback I tried to address it in the latest commit.
            githubbot ASF GitHub Bot added a comment -

            pitrou commented on code in PR #185:
            URL: https://github.com/apache/parquet-format/pull/185#discussion_r1041067303

            ##########
            README.md:
            ##########
            @@ -144,6 +144,38 @@ documented in [LogicalTypes.md][logical-types].

            [logical-types]: LogicalTypes.md

            +### Sort Order
            +
            +Parquet stores min/max statistics at several levels (e.g. RowGroup, Page Index,
            +etc). Comparison for values of a type follow the following logic:

            Review Comment:
            ```suggestion
            Parquet stores min/max statistics at several levels (such as Column Chunk,
            Column Index and Data Page). Comparison for values of a type obey the
            following rules:
            ```

            ##########
            README.md:
            ##########
            @@ -144,6 +144,38 @@ documented in [LogicalTypes.md][logical-types].

            [logical-types]: LogicalTypes.md

            +### Sort Order
            +
            +Parquet stores min/max statistics at several levels (e.g. RowGroup, Page Index,
            +etc). Comparison for values of a type follow the following logic:
            +
            +1. Each logical type has a specified comparison order. If a column is
            + annotated with an unknown logical type, statistics may not be used
            + for pruning data. The sort order for logical types is documented in
            + the [LogicalTypes.md][logical-types] page.
            +2. For primitives the following sort orders apply:
            +
            + * BOOLEAN - false, true
            + * INT32, INT64, FLOAT, DOUBLE - Signed comparison. Floating point values are
            + not totally ordered due to special case like NaN. They require special
            + handling when reading statistics. The details are documented in parquet.thrift in the
            + `ColumnOrder` union. They are summarized

            Review Comment:
            Need to synchronize this with the final wording from `parquet.thrift`.

            ##########
            README.md:
            ##########
            @@ -144,6 +144,38 @@ documented in [LogicalTypes.md][logical-types].

            [logical-types]: LogicalTypes.md

            +### Sort Order
            +
            +Parquet stores min/max statistics at several levels (e.g. RowGroup, Page Index,
            +etc). Comparison for values of a type follow the following logic:
            +
            +1. Each logical type has a specified comparison order. If a column is
            + annotated with an unknown logical type, statistics may not be used
            + for pruning data. The sort order for logical types is documented in
            + the [LogicalTypes.md][logical-types] page.
            +2. For primitives the following sort orders apply:

            Review Comment:
            ```suggestion
            2. For primitive types, the following rules apply:
            ```

            ##########
            src/main/thrift/parquet.thrift:
            ##########
            @@ -902,6 +902,13 @@ union ColumnOrder {

            • - If the min is +0, the row group may contain -0 values as well.
            • - If the max is -0, the row group may contain +0 values as well.
            • - When looking for NaN values, min and max should be ignored.
              + *
              + * When writing statistics the following rules should be followed:
              + * - NaNs should not be written to min or max statistics fields.
              + * - Only -0 should be written into min statistics fields (if only
              + * +0 is present in the column it should be converted to -0.0).
              + * - Only +0 should be written into a max statistics fields (if
              + * only -0 is present it must be convereted to +0).

            Review Comment:
            Suggestion to make wording clearer.
            ```suggestion

            • - If the computed max value is zero (whether negative or positive),
            • `+0.0` should be written into the max statistics field.
            • - If the computed min value is zero (whether negative or positive),
            • `-0.0` should be written into the min statistics field.
              ```

            ##########
            README.md:
            ##########
            @@ -144,6 +144,38 @@ documented in [LogicalTypes.md][logical-types].

            [logical-types]: LogicalTypes.md

            +### Sort Order
            +
            +Parquet stores min/max statistics at several levels (e.g. RowGroup, Page Index,
            +etc). Comparison for values of a type follow the following logic:
            +
            +1. Each logical type has a specified comparison order. If a column is
            + annotated with an unknown logical type, statistics may not be used
            + for pruning data. The sort order for logical types is documented in
            + the [LogicalTypes.md][logical-types] page.
            +2. For primitives the following sort orders apply:
            +
            + * BOOLEAN - false, true
            + * INT32, INT64, FLOAT, DOUBLE - Signed comparison. Floating point values are

            Review Comment:
            I would suggest making a separate list item for floating-point:
            ```suggestion

            • INT32, INT64 - Signed comparison.
            • FLOAT, DOUBLE - Signed comparison with special handling of NaNs
              and signed zeros. The details are documented in...
              ```
            githubbot ASF GitHub Bot added a comment - pitrou commented on code in PR #185: URL: https://github.com/apache/parquet-format/pull/185#discussion_r1041067303 ########## README.md: ########## @@ -144,6 +144,38 @@ documented in [LogicalTypes.md] [logical-types] . [logical-types] : LogicalTypes.md +### Sort Order + +Parquet stores min/max statistics at several levels (e.g. RowGroup, Page Index, +etc). Comparison for values of a type follow the following logic: Review Comment: ```suggestion Parquet stores min/max statistics at several levels (such as Column Chunk, Column Index and Data Page). Comparison for values of a type obey the following rules: ``` ########## README.md: ########## @@ -144,6 +144,38 @@ documented in [LogicalTypes.md] [logical-types] . [logical-types] : LogicalTypes.md +### Sort Order + +Parquet stores min/max statistics at several levels (e.g. RowGroup, Page Index, +etc). Comparison for values of a type follow the following logic: + +1. Each logical type has a specified comparison order. If a column is + annotated with an unknown logical type, statistics may not be used + for pruning data. The sort order for logical types is documented in + the [LogicalTypes.md] [logical-types] page. +2. For primitives the following sort orders apply: + + * BOOLEAN - false, true + * INT32, INT64, FLOAT, DOUBLE - Signed comparison. Floating point values are + not totally ordered due to special case like NaN. They require special + handling when reading statistics. The details are documented in parquet.thrift in the + `ColumnOrder` union. They are summarized Review Comment: Need to synchronize this with the final wording from `parquet.thrift`. ########## README.md: ########## @@ -144,6 +144,38 @@ documented in [LogicalTypes.md] [logical-types] . [logical-types] : LogicalTypes.md +### Sort Order + +Parquet stores min/max statistics at several levels (e.g. RowGroup, Page Index, +etc). Comparison for values of a type follow the following logic: + +1. Each logical type has a specified comparison order. If a column is + annotated with an unknown logical type, statistics may not be used + for pruning data. The sort order for logical types is documented in + the [LogicalTypes.md] [logical-types] page. +2. For primitives the following sort orders apply: Review Comment: ```suggestion 2. For primitive types, the following rules apply: ``` ########## src/main/thrift/parquet.thrift: ########## @@ -902,6 +902,13 @@ union ColumnOrder { - If the min is +0, the row group may contain -0 values as well. - If the max is -0, the row group may contain +0 values as well. - When looking for NaN values, min and max should be ignored. + * + * When writing statistics the following rules should be followed: + * - NaNs should not be written to min or max statistics fields. + * - Only -0 should be written into min statistics fields (if only + * +0 is present in the column it should be converted to -0.0). + * - Only +0 should be written into a max statistics fields (if + * only -0 is present it must be convereted to +0). Review Comment: Suggestion to make wording clearer. ```suggestion - If the computed max value is zero (whether negative or positive), `+0.0` should be written into the max statistics field. - If the computed min value is zero (whether negative or positive), `-0.0` should be written into the min statistics field. ``` ########## README.md: ########## @@ -144,6 +144,38 @@ documented in [LogicalTypes.md] [logical-types] . [logical-types] : LogicalTypes.md +### Sort Order + +Parquet stores min/max statistics at several levels (e.g. RowGroup, Page Index, +etc). Comparison for values of a type follow the following logic: + +1. Each logical type has a specified comparison order. If a column is + annotated with an unknown logical type, statistics may not be used + for pruning data. The sort order for logical types is documented in + the [LogicalTypes.md] [logical-types] page. +2. For primitives the following sort orders apply: + + * BOOLEAN - false, true + * INT32, INT64, FLOAT, DOUBLE - Signed comparison. Floating point values are Review Comment: I would suggest making a separate list item for floating-point: ```suggestion INT32, INT64 - Signed comparison. FLOAT, DOUBLE - Signed comparison with special handling of NaNs and signed zeros. The details are documented in... ```
            githubbot ASF GitHub Bot added a comment -

            emkornfield commented on code in PR #185:
            URL: https://github.com/apache/parquet-format/pull/185#discussion_r1041775669

            ##########
            README.md:
            ##########
            @@ -144,6 +144,38 @@ documented in [LogicalTypes.md][logical-types].

            [logical-types]: LogicalTypes.md

            +### Sort Order
            +
            +Parquet stores min/max statistics at several levels (e.g. RowGroup, Page Index,
            +etc). Comparison for values of a type follow the following logic:
            +
            +1. Each logical type has a specified comparison order. If a column is
            + annotated with an unknown logical type, statistics may not be used
            + for pruning data. The sort order for logical types is documented in
            + the [LogicalTypes.md][logical-types] page.
            +2. For primitives the following sort orders apply:
            +
            + * BOOLEAN - false, true
            + * INT32, INT64, FLOAT, DOUBLE - Signed comparison. Floating point values are
            + not totally ordered due to special case like NaN. They require special
            + handling when reading statistics. The details are documented in parquet.thrift in the
            + `ColumnOrder` union. They are summarized

            Review Comment:
            done.

            githubbot ASF GitHub Bot added a comment - emkornfield commented on code in PR #185: URL: https://github.com/apache/parquet-format/pull/185#discussion_r1041775669 ########## README.md: ########## @@ -144,6 +144,38 @@ documented in [LogicalTypes.md] [logical-types] . [logical-types] : LogicalTypes.md +### Sort Order + +Parquet stores min/max statistics at several levels (e.g. RowGroup, Page Index, +etc). Comparison for values of a type follow the following logic: + +1. Each logical type has a specified comparison order. If a column is + annotated with an unknown logical type, statistics may not be used + for pruning data. The sort order for logical types is documented in + the [LogicalTypes.md] [logical-types] page. +2. For primitives the following sort orders apply: + + * BOOLEAN - false, true + * INT32, INT64, FLOAT, DOUBLE - Signed comparison. Floating point values are + not totally ordered due to special case like NaN. They require special + handling when reading statistics. The details are documented in parquet.thrift in the + `ColumnOrder` union. They are summarized Review Comment: done.
            githubbot ASF GitHub Bot added a comment -

            emkornfield commented on PR #185:
            URL: https://github.com/apache/parquet-format/pull/185#issuecomment-1340395605

            @gszadovszky @pitrou thanks for the review. I believe I incorporated the remaining feedback.

            githubbot ASF GitHub Bot added a comment - emkornfield commented on PR #185: URL: https://github.com/apache/parquet-format/pull/185#issuecomment-1340395605 @gszadovszky @pitrou thanks for the review. I believe I incorporated the remaining feedback.
            githubbot ASF GitHub Bot added a comment - pitrou merged PR #185: URL: https://github.com/apache/parquet-format/pull/185
            githubbot ASF GitHub Bot added a comment -

            pitrou commented on PR #185:
            URL: https://github.com/apache/parquet-format/pull/185#issuecomment-1340540614

            Thanks @emkornfield !

            githubbot ASF GitHub Bot added a comment - pitrou commented on PR #185: URL: https://github.com/apache/parquet-format/pull/185#issuecomment-1340540614 Thanks @emkornfield !
            githubbot ASF GitHub Bot added a comment -

            pitrou commented on PR #185:
            URL: https://github.com/apache/parquet-format/pull/185#issuecomment-1341327375

            > @pitrou thanks for merging, I'm not sure if we needed an official vote on this?

            Oops, that's a good question. I might have merged too quickly. @gszadovszky What is your take on this?

            githubbot ASF GitHub Bot added a comment - pitrou commented on PR #185: URL: https://github.com/apache/parquet-format/pull/185#issuecomment-1341327375 > @pitrou thanks for merging, I'm not sure if we needed an official vote on this? Oops, that's a good question. I might have merged too quickly. @gszadovszky What is your take on this?
            githubbot ASF GitHub Bot added a comment -

            gszadovszky commented on PR #185:
            URL: https://github.com/apache/parquet-format/pull/185#issuecomment-1341331959

            I don't feel it would require a formal vote. In my view it was more a missing part of the spec that is made clear.
            It might worth a heads up on the dev list, though.

            githubbot ASF GitHub Bot added a comment - gszadovszky commented on PR #185: URL: https://github.com/apache/parquet-format/pull/185#issuecomment-1341331959 I don't feel it would require a formal vote. In my view it was more a missing part of the spec that is made clear. It might worth a heads up on the dev list, though.
            githubbot ASF GitHub Bot added a comment -

            emkornfield commented on PR #185:
            URL: https://github.com/apache/parquet-format/pull/185#issuecomment-1341349709

            > I don't feel it would require a formal vote. In my view it was more a missing part of the spec that is made clear.
            It might worth a heads up on the dev list, though.

            Replied to my original thread that this has been merged.

            githubbot ASF GitHub Bot added a comment - emkornfield commented on PR #185: URL: https://github.com/apache/parquet-format/pull/185#issuecomment-1341349709 > I don't feel it would require a formal vote. In my view it was more a missing part of the spec that is made clear. It might worth a heads up on the dev list, though. Replied to my original thread that this has been merged.
            rok Rok Mihevc added a comment -

            This issue has been migrated to issue #342 on GitHub. Please see the migration documentation for further details.

            rok Rok Mihevc added a comment - This issue has been migrated to issue #342 on GitHub. Please see the migration documentation for further details.

            People

              emkornfield Micah Kornfield
              zi Zoltan Ivanfi
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: