Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
At the moment, from python you can write a dataset with ds.write_dataset for example starting from a list of record batches.
But this currently needs to be an actual list (or gets converted to a list), so an iterator or generated gets fully consumed (potentially bringing the record batches in memory), before starting to write.
We should also be able to use the python iterator itself to back a RecordBatchIterator-like object, that can be consumed while writing the batches.
We already have a arrow::py::PyRecordBatchReader that might be useful here.
Attachments
Issue Links
- is related to
-
ARROW-12231 [C++][Dataset] Separate datasets backed by readers from InMemoryDataset
- Resolved
- links to