Commits

Wes McKinney authored 9d532c49d56
ARROW-539: [Python] Add support for reading partitioned Parquet files with Hive-like directory schemes I probably didn't get all the use cases, but this should be a good start. First, the directory structure is walked to determine the distinct partition keys. These keys are later used as the dictionary for `arrow::DictionaryArray` objects which are constructed. I also created the `ParquetDatasetPiece` class to enable distributed processing of file components in frameworks like Dask. We may need to address pickling of the `ParquetPartitions` object (which must be passed to `ParquetDatasetPiece.read` so the right array metadata can be constructed. Author: Wes McKinney <wes.mckinney@twosigma.com> Author: Miki Tebeka <miki.tebeka@gmail.com> Closes #529 from wesm/ARROW-539 and squashes the following commits: a0451fa [Wes McKinney] Code review comments deb6d82 [Wes McKinney] Don't make file-like Python object on LocalFilesystem 04dc691 [Wes McKinney] Complete initial partitioned reads, supporting unit tests. Expose arrow::Table::AddColumn 7d33755 [Wes McKinney] Untested draft of ParquetManifest for partitioned directory structures. Get test suite passing again ba8825f [Wes McKinney] Prototyping 18fe639 [Wes McKinney] Refactoring, add ParquetDataset, ParquetDatasetPiece 016b445 [Miki Tebeka] [ARROW-539] [Python] Support reading Parquet datasets with standard partition directory schemes