Public / arrow / 2ba0566b293

Commits

Wes McKinney authored 2ba0566b29316 Aug 2019
ARROW-3246: [C++][Python][Parquet] Direct writing of DictionaryArray to Parquet columns, automatic decoding to Arrow

There's a lot going of interconnected pieces in this patch, so let me try to unpack:

* Refactor `TypedColumnWriterImpl::WriteBatch/WriteBatchSpaced` to utilize more common code and be more readable
* Add `TypedEncoder<T>::Put(const arrow::Array&)` and implement for BYTE_ARRAY so avoid having to first create `ByteArray*` as before. This should improve write performance for regular binary data -- I will do some benchmarks to measure by how much
* Add `TypedStatistics<T>::Update(const arrow::Array&)` and implement for BYTE_ARRAY. This is necessary to be able to update the statistics given directly-inserted Arrow data without serialization
* Implement `PutDictionary` and `PutIndices` methods on `DictEncoder`. `PutDictionary` is only implemented for BYTE_ARRAY but can be easily generalized to more types (we should open a follow up JIRA for this)
* Implement internal `TypedColumnWriterImpl::WriteArrowDictionary` that writes dictionary values and indices directly into a DictEncoder. This circumvents the dictionary page size checks so that we will continue to call `PutIndices` until a new dictionary is encountered or some non-dictionary data is written. Note that in master, dictionary encoding is turned off as soon as a threshold in dictionary size is reached, which is by default 1MB. So if you want to preserve exactly the original dictionary values (e.g. if you are roundtripping DictionaryArray, R factor, or pandas.Categorical), then we have to step around this threshold check in this narrow case.
* Add `ArrowWriterProperties::store_schema()` option which stores the Arrow schema used to create a Parquet file in a special `ARROW:schema` key in the metadata, so that we can detect that a column was originally DictionaryArray. This option is off by default, but enabled in the Python bindings. We can always make it the default in the future

I think that's most things. One end result of this is that `arrow::DictionaryArray` types from C++ and `pandas.Categorical` types coerced from pandas with string dictionary values will be accurately preserved end-to-end. With a little more work (which I think can be done in a follow up PR) we can support the rest of the Parquet physical types.

This was a fairly ugly project and I've doubtlessly left some messes around that we should clean up, but perhaps in follow up patches.

I'll post some benchmarks later to assess improvements in read and write performance. In the case of writing dictionary-encoded strings the boost should be significant.

Closes #5077 from wesm/ARROW-3246 and squashes the following commits:

6b1769cb1 <Wes McKinney> Restore statistics aliases
ad3bad34a <Wes McKinney> Address code review comments
7f3a2a89f <Wes McKinney> Fix another KeyValueMetadata factory
494b954ac <Wes McKinney> Use other KeyValueMetadata factory function to hopefully appease MinGW
92cf4e063 <Wes McKinney> Code review feedback
91555b645 <Wes McKinney> Fix another new MSVC warning
3f45fef0a <Wes McKinney> Check more random seeds, fix warnings
8f4cd4463 <Wes McKinney> Fix DecodeArrow bug which only occurred when there are nulls at the end of the data
5dc00b1ff <Wes McKinney> Fix MSVC compilation warnings
3425da4f2 <Wes McKinney> Revert change causing ASAN failure
f26d7da80 <Wes McKinney> Fix up Python unit tests given schema serialization
7d663d5ac <Wes McKinney> Store schema when writing from Python, add unit test to exhibit direct dictionary reads
7705fdbc8 <Wes McKinney> Automatically read dictionary fields by serializing the Arrow schema with the store_schema option
580a0ca9c <Wes McKinney> Add failing unit test for Arrow store schema option
28268d624 <Wes McKinney> Add unit test for writing changing dictionaries
de9d0a5ae <Wes McKinney> Fix null dictionary test, unit tests passing again
138053158 <Wes McKinney> Closer to full dictionary write, NA test failing
0a293ee27 <Wes McKinney> More scaffolding
8cc1bcfa9 <Wes McKinney> Unit test for PutDictionary, PutIndices
5aaf2817b <Wes McKinney> Temp
d21ebd852 <Wes McKinney> Get all direct put unit tests passing
edc9f8473 <Wes McKinney> Fix unit tests
57a45e0ce <Wes McKinney> Direct binary put works
1264e10cc <Wes McKinney> More direct encoding implementation stubs
882e4341f <Wes McKinney> TypedComparator/TypedStatistics augmentations for arrow::BinaryArray
245f44579 <Wes McKinney> ByteArray statistics specializations
c4d7dc279 <Wes McKinney> Refactor and add Arrow encoder stubs
c871ea971 <Wes McKinney> Refactor WriteBatch/WriteBatchSpaced to utilize helper functions

Authored-by: Wes McKinney <wesm+git@apache.org>
Signed-off-by: Wes McKinney <wesm+git@apache.org>