[PARQUET-1539] Clarify CRC checksum in page header - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: format-2.7.0
Component/s: parquet-format
Labels:
- pull-request-available

Description

Although a page-level CRC field is defined in the Thrift specification, currently neither parquet-cpp nor parquet-mr leverage it. Moreover, the comment in the Thrift specification reads ‘32bit crc for the data below’, which is somewhat ambiguous to what exactly constitutes the ‘data’ that the checksum should be calculated on. To ensure backward- and cross-compatibility of Parquet readers/writes which do want to leverage the CRC checksums, the format should specify exactly how and on what data the checksum should be calculated.

Alternatives

There are three main choices to be made here:

Which variant of CRC32 to use
Whether to include the page header itself in the checksum calculation
Whether to calculate the checksum on uncompressed or compressed data

Algorithm

The CRC field holds a 32-bit value. There are many different variants of the original CRC32 algorithm, each producing different values for the same input. For ease of implementation we propose to use the standard CRC32 algorithm.

Including page header

The page header itself could be included in the checksum calculation using an approach similar to what TCP does, whereby the checksum field itself is zeroed out before calculating the checksum that will be inserted there. Evidently, including the page header is better in the sense that it increases the data covered by the checksum. However, from an implementation perspective, not including it is likely easier. Furthermore, given the relatively small size of the page header compared to the page itself, simply not including it will likely be good enough.

Compressed vs uncompressed

Compressed
Pros

Inherently faster, less data to operate on
Potentially better triaging when determining where a corruption may have been introduced, as checksum is calculated in a later stage

Cons

We have to trust both the encoding stage and the compression stage

Uncompressed
Pros

We only have to trust the encoding stage
Possibly able to detect more corruptions, as data is checksummed at earliest possible moment, checksum will be more sensitive to corruption introduced further down the line

Cons

Inherently slower, more data to operate on, always need to decompress first
Potentially harder triaging, more stages in which corruption could have been introduced

Proposal

The checksum will be calculated using the standard CRC32 algorithm, whereby the checksum is to be calculated on the data only, not including the page header itself (simple implementation) and the checksum will be calculated on compressed data (inherently faster, likely better triaging).

Attachments

Issue Links

causes

PARQUET-2218 [Format] Clarify CRC computation

Resolved

links to

GitHub Pull Request #126

Activity

People

Assignee:: Boudewijn Braams

Reporter:: Boudewijn Braams

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 24/Feb/19 20:49

Updated:: 23/Jun/24 03:30

Resolved:: 05/Mar/19 13:41