[ARROW-10882] [Python][Dataset] Writing dataset from python iterator of record batches - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 4.0.0
Component/s: Python
Labels:
- dataset
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/26817

Description

At the moment, from python you can write a dataset with ds.write_dataset for example starting from a list of record batches.

But this currently needs to be an actual list (or gets converted to a list), so an iterator or generated gets fully consumed (potentially bringing the record batches in memory), before starting to write.

We should also be able to use the python iterator itself to back a RecordBatchIterator-like object, that can be consumed while writing the batches.

We already have a arrow::py::PyRecordBatchReader that might be useful here.

Attachments

Issue Links

is related to

ARROW-12231 [C++][Dataset] Separate datasets backed by readers from InMemoryDataset

Resolved

links to

GitHub Pull Request #9802

Activity

People

Assignee:: David Li

Reporter:: Joris Van den Bossche

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 11/Dec/20 10:58

Updated:: 11/Jan/23 08:16

Resolved:: 06/Apr/21 12:01

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

2h 40m