Public / arrow / 543f33ad6e5

Commits

Weston Pace authored 543f33ad6e505 Oct 2021

ARROW-13650: [C++] Create dataset writer to encapsulate dataset writer logic

This ended up being a fairly comprehensive overhaul of the existing dataset writer mechanism. ~~This PR relies on #10968 and will remain in draft until that is completed.~~

Breaking Changes:

* The dataset writer no longer works with the synchronous scanner. I don't think this would be a huge change but I think the current plan is to hopefully deprecate the synchronous scanner. (ARROW-13338) This required changes in a python/r/ruby which will presumably be reverted when ARROW-13338 is done.
* The default behavior is now to error if the output directory has any existing data. This can be controlled with `existing_data_behavior` (see below)
* Previously a single global counter was used for naming files. This PR changes to a counter per directory. So the following...
```
/a1/b1/part-0.parquet
/a1/b1/part-2.parquet
/a1/b2/part-1.parquet
```
...would be impossible. Instead you would receive...
```
/a1/b1/part-0.parquet
/a1/b1/part-1.parquet
/a1/b2/part-0.parquet
```
...this does not, however, mean that the resulting data files will be deterministic. If the data in `/a1/b1/part-0.parquet` and `/a1/b1/part-1.parquet` originated from two different files being scanned in an unordered fashion then either part could represent either file. A number of test cases in all implementations had to change as the expected paths for dataset writes changed.

* New features:
* The dataset writer now works with the async scanner (ARROW-12803)
* The dataset writer now respects backpressure (closes ARROW-2628?, related to but does not fully solve ARROW-13590 and ARROW-13611) and will stop pulling from the scanner when `max_rows_queued` (provided as an argument to `DatasetWriter`) is exceeded. By default `max_rows_queued` is 64M. This is not an "option" as it I don't think it should be exposed to the user. I think it would be offering too many knobs. I think eventually we may want to wrap up all backpressure into a single configurable setting.
* `FileSystemDatasetWriteOptions` now has a `max_rows_per_file` setting (ARROW-10439).
* `FileSystemDatasetWriteOptions` now has a `max_open_files` setting (ARROW-12321) which prevents opening too many files. Instead the writer will apply backpressure on the scanner while also closing the open file with the greatest # of rows already written (then resume writing once the file is closed).
* `FileSystemDatasetWriteOptions` now has a `existing_data_behavior` setting (ARROW-12358, ARROW-7706) which controls what to do if there is data in the destination.

* Deferred for future work:
* Add the new options to the python/R APIs (ARROW-13703)
* Limiting based on file size (ARROW-10439)
* More fine grained error control (ARROW-14175)

Closes #10955 from westonpace/feature/ARROW-13542--c-compute-dataset-add-dataset-writenode-for

Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>