Commits

Weston Pace authored 543f33ad6e5
ARROW-13650: [C++] Create dataset writer to encapsulate dataset writer logic This ended up being a fairly comprehensive overhaul of the existing dataset writer mechanism. ~~This PR relies on #10968 and will remain in draft until that is completed.~~ Breaking Changes: * The dataset writer no longer works with the synchronous scanner. I don't think this would be a huge change but I think the current plan is to hopefully deprecate the synchronous scanner. (ARROW-13338) This required changes in a python/r/ruby which will presumably be reverted when ARROW-13338 is done. * The default behavior is now to error if the output directory has any existing data. This can be controlled with `existing_data_behavior` (see below) * Previously a single global counter was used for naming files. This PR changes to a counter per directory. So the following... ``` /a1/b1/part-0.parquet /a1/b1/part-2.parquet /a1/b2/part-1.parquet ``` ...would be impossible. Instead you would receive... ``` /a1/b1/part-0.parquet /a1/b1/part-1.parquet /a1/b2/part-0.parquet ``` ...this does not, however, mean that the resulting data files will be deterministic. If the data in `/a1/b1/part-0.parquet` and `/a1/b1/part-1.parquet` originated from two different files being scanned in an unordered fashion then either part could represent either file. A number of test cases in all implementations had to change as the expected paths for dataset writes changed. * New features: * The dataset writer now works with the async scanner (ARROW-12803) * The dataset writer now respects backpressure (closes ARROW-2628?, related to but does not fully solve ARROW-13590 and ARROW-13611) and will stop pulling from the scanner when `max_rows_queued` (provided as an argument to `DatasetWriter`) is exceeded. By default `max_rows_queued` is 64M. This is not an "option" as it I don't think it should be exposed to the user. I think it would be offering too many knobs. I think eventually we may want to wrap up all backpressure into a single configurable setting. * `FileSystemDatasetWriteOptions` now has a `max_rows_per_file` setting (ARROW-10439). * `FileSystemDatasetWriteOptions` now has a `max_open_files` setting (ARROW-12321) which prevents opening too many files. Instead the writer will apply backpressure on the scanner while also closing the open file with the greatest # of rows already written (then resume writing once the file is closed). * `FileSystemDatasetWriteOptions` now has a `existing_data_behavior` setting (ARROW-12358, ARROW-7706) which controls what to do if there is data in the destination. * Deferred for future work: * Add the new options to the python/R APIs (ARROW-13703) * Limiting based on file size (ARROW-10439) * More fine grained error control (ARROW-14175) Closes #10955 from westonpace/feature/ARROW-13542--c-compute-dataset-add-dataset-writenode-for Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>