[PARQUET-2207] support saving meta and data seperately - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: parquet-format
Labels:
None

Description

I often needs to create tens of milliions of small dataframes and save them into parquet files. all these dataframes have the same column and index information. and normally they have the same number of rows(around 300).

as the data is quite small, the parquet meta information is relatively large and it's quite a big waste of disk space, as the same meta information is repeated tens of millions of times.

concating them into one big parquet file can save disk space, but it's not friendly for parallel processing of each small dataframe.

if I can save one copy of the meta information into one file, and the rest parquet files contains only the data. then the disk space can be saved, and still good for parallel processing.

seems to me this is possible by design, but I couldn't find any API supporting this.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: lei yu

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 27/Oct/22 01:40

Updated:: 23/Jun/24 03:32