Commits


Benjamin Kietzman authored and Joris Van den Bossche committed 868777d0417
ARROW-10131: [C++][Dataset][Python] Lazily parse parquet metadata ParquetFileFragment now constructs `Expression`s from its statistics lazily; the min/max expression for a column is materialized only when a predicate references that column. Additional changes: - ParquetFileFragment now simply stores a parquet::FileMetaData, which is loaded opportunistically anytime IO becomes unavoidable. - In python, accessing any RowGroup or file-level properties of ParquetFileFragment will force load of metadata (so for example `ParquetFileFragment.num_row_groups` no longer yields -1 to indicate that metadata has not been loaded). - RowGroupInfo has been removed from C++. A python-only replacement remains for compatibility but we might want to remove that as well. - Reduced path manipulation in ParquetDatasetFactory, including avoidance of validation of ColumnChunk paths by default. - ParquetScanTaskIterator has been removed - Added FileMetadata::Subset, which returns a FIleMetaData wrapping only a subset of row groups. - Added native equality comparison between FileMetaData, RowGroupMetaData, ColumnChunkMetaData, Statistics (ARROW-4970) Closes #8507 from bkietz/10131-Lazily-parse-parquet-meta Authored-by: Benjamin Kietzman <bengilgit@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>