Public / arrow / 6716bbd25ea

Commits

62d20551e62
François Saint-Jacques authored 6716bbd25ea26 May 2020
ARROW-8062: [C++][Dataset] Implement ParquetDatasetFactory

This patch adds the option to create a dataset of parquet files via `ParquetDatasetFactory`. It reads a single  `_metadata` parquet file created by systems like Dask and Spark, extract the metadata of all fragments from said file, and populate each fragment with extra statistics for each columns. The `_metadata` file can be created via `pyarrow.parquet.write_metadata`.

When the Scan operation is materialised, the row groups of the ParquetFileFragment are elided with the statistics _before_ reading the original file metadata. If no RowGroups from a file matches the predicate of the Scan, the file will not be read (including the metadata footer), thus avoiding expensive IO calls. The optimisation benefits are inversely proportional to the predicate's selectivity.

```python
# With the plain FileSystemDataset
%timeit t = nyc_tlc_fs_dataset.to_table(filter=da.field('total_amount') > 1000.0, ...)
1.55 s ± 26 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# With ParquetDatasetFactory
%timeit t = nyc_tlc_parquet_dataset.to_table(filter=da.field('total_amount') > 1000.0, ...)
336 ms ± 17.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

- Implement ParquetDatasetFactory
- Replace ParquetFileFormat::GetRowGroupFragments with
  ParquetFileFragment::SplitByRowGroup (and the corresponding bindings).
- Add various optimizations, notably in ColumnChunkStatisticsAsExpression.
- Consolidate RowGroupSkipper logic in ParquetFileFragment::ScanFile
- Ensure FileMetaData::AppendRowGroups checks for schema equality.
- Implement dataset._parquet_dataset

Closes #7180 from fsaintjacques/ARROW-8062-parquet-dataset-metadata

Lead-authored-by: François Saint-Jacques <fsaintjacques@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: François Saint-Jacques <fsaintjacques@gmail.com>