Public / arrow / 7aefa50a440

Commits

Wes McKinney authored 7aefa50a44006 Aug 2019
ARROW-3325: [Python][Parquet] Add "read_dictionary" argument to parquet.read_table, ParquetDataset to enable direct-to-DictionaryArray reads

I also added support to `pyarrow.table` to invoke `Table.from_arrays` if a list or tuple of arrays is passed. This makes for more natural code IMHO.

Using this option with heavily compressed data results in far less memory use and much better performance. See example benchmarks

https://gist.github.com/wesm/450d85e52844aee685c0680111cbb1d7

Closes #4999 from wesm/ARROW-3325 and squashes the following commits:

2ca388149 <Wes McKinney> Improve docstring for read_dictionary parameter, add to ParquetDataset
ee73d7b41 <Wes McKinney> Add missing PARQUET_EXPORT
0f450d53e <Wes McKinney> Clean up FileReaderBuilder. Add simle Python docs
8e2b70b1a <Wes McKinney> Expand read_dictionary with ParquetDataset test for multiple files
7237e6958 <Wes McKinney> Fix C++ and Python unit tests
9d503516f <Wes McKinney> Read Parquet fields directly as DictionaryArray in parquet.read_table and ParquetDataset
85f9b7206 <Wes McKinney> Initial threading of read_dictionary parameter, not terribly satisfying

Authored-by: Wes McKinney <wesm+git@apache.org>
Signed-off-by: Wes McKinney <wesm+git@apache.org>