Public / arrow / 9d532c49d56

Commits

Wes McKinney authored 9d532c49d5613 Apr 2017
ARROW-539: [Python] Add support for reading partitioned Parquet files with Hive-like directory schemes

I probably didn't get all the use cases, but this should be a good start.

First, the directory structure is walked to determine the distinct partition keys. These keys are later used as the dictionary for `arrow::DictionaryArray` objects which are constructed.

I also created the `ParquetDatasetPiece` class to enable distributed processing of file components in frameworks like Dask. We may need to address pickling of the `ParquetPartitions` object (which must be passed to `ParquetDatasetPiece.read` so the right array metadata can be constructed.

Author: Wes McKinney <wes.mckinney@twosigma.com>
Author: Miki Tebeka <miki.tebeka@gmail.com>

Closes #529 from wesm/ARROW-539 and squashes the following commits:

a0451fa [Wes McKinney] Code review comments
deb6d82 [Wes McKinney] Don't make file-like Python object on LocalFilesystem
04dc691 [Wes McKinney] Complete initial partitioned reads, supporting unit tests. Expose arrow::Table::AddColumn
7d33755 [Wes McKinney] Untested draft of ParquetManifest for partitioned directory structures. Get test suite passing again
ba8825f [Wes McKinney] Prototyping
18fe639 [Wes McKinney] Refactoring, add ParquetDataset, ParquetDatasetPiece
016b445 [Miki Tebeka] [ARROW-539] [Python] Support reading Parquet datasets with standard partition directory schemes