Commits

Benjamin Kietzman authored ae396b9d4c2
ARROW-9782: [C++][Dataset] More configurable Dataset writing Python: - ParquetFileFormat.write_options has been removed - Added classes {,Parquet,Ipc}FileWriteOptions - FileWriteOptions are constructed using FileFormat.make_write_options(...) - FileWriteOptions are passed as a parameter to _filesystemdataset_write() R: - FileWriteOptions$create(...) to make write options; no subclasses exposed in R - A filter() on the dataset is applied to restrict written rows. C++: - FileSystemDataset::Write's parameters have been consolidated into - A Scanner, from which the batches to be written are pulled - A FileSystemDatasetWriteOptions, which is an options struct specifying - destination filesystem - base directory - partitioning - basenames (via a string template, ex "dat_{i}.feather") - format specific write options - Format specific write options are represented using the FileWriteOptions hierarchy. An instance of these can be constructed from a format using FileFormat::DefaultWriteOptions(), after which the instance can be modified. - ParquetFileFormat::{writer_properties, arrow_writer_properties} have been moved to ParquetFileWriteOptions, an implementation of FileWriteOptions. Internal C++: - Individual files can now be incrementally written using a FileWriter, constructible from a format using FileFormat::MakeWriter - FileSystemDataset::Write now parallelizes across scan tasks rather than fragments, so there will be no difference in performance for different arrangements of tables/batches/lists of tables and batches when writing from memory - FileSystemDataset::Write::WriteQueue provides a threadsafe channel for batches awaiting write, allowing threads to produce batches as another thread flushes the queue to disk. Closes #8305 from bkietz/9782-more-configurable-writing Lead-authored-by: Benjamin Kietzman <bengilgit@gmail.com> Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>