Commits

Wes McKinney authored 2ba0566b293
ARROW-3246: [C++][Python][Parquet] Direct writing of DictionaryArray to Parquet columns, automatic decoding to Arrow There's a lot going of interconnected pieces in this patch, so let me try to unpack: * Refactor `TypedColumnWriterImpl::WriteBatch/WriteBatchSpaced` to utilize more common code and be more readable * Add `TypedEncoder<T>::Put(const arrow::Array&)` and implement for BYTE_ARRAY so avoid having to first create `ByteArray*` as before. This should improve write performance for regular binary data -- I will do some benchmarks to measure by how much * Add `TypedStatistics<T>::Update(const arrow::Array&)` and implement for BYTE_ARRAY. This is necessary to be able to update the statistics given directly-inserted Arrow data without serialization * Implement `PutDictionary` and `PutIndices` methods on `DictEncoder`. `PutDictionary` is only implemented for BYTE_ARRAY but can be easily generalized to more types (we should open a follow up JIRA for this) * Implement internal `TypedColumnWriterImpl::WriteArrowDictionary` that writes dictionary values and indices directly into a DictEncoder. This circumvents the dictionary page size checks so that we will continue to call `PutIndices` until a new dictionary is encountered or some non-dictionary data is written. Note that in master, dictionary encoding is turned off as soon as a threshold in dictionary size is reached, which is by default 1MB. So if you want to preserve exactly the original dictionary values (e.g. if you are roundtripping DictionaryArray, R factor, or pandas.Categorical), then we have to step around this threshold check in this narrow case. * Add `ArrowWriterProperties::store_schema()` option which stores the Arrow schema used to create a Parquet file in a special `ARROW:schema` key in the metadata, so that we can detect that a column was originally DictionaryArray. This option is off by default, but enabled in the Python bindings. We can always make it the default in the future I think that's most things. One end result of this is that `arrow::DictionaryArray` types from C++ and `pandas.Categorical` types coerced from pandas with string dictionary values will be accurately preserved end-to-end. With a little more work (which I think can be done in a follow up PR) we can support the rest of the Parquet physical types. This was a fairly ugly project and I've doubtlessly left some messes around that we should clean up, but perhaps in follow up patches. I'll post some benchmarks later to assess improvements in read and write performance. In the case of writing dictionary-encoded strings the boost should be significant. Closes #5077 from wesm/ARROW-3246 and squashes the following commits: 6b1769cb1 <Wes McKinney> Restore statistics aliases ad3bad34a <Wes McKinney> Address code review comments 7f3a2a89f <Wes McKinney> Fix another KeyValueMetadata factory 494b954ac <Wes McKinney> Use other KeyValueMetadata factory function to hopefully appease MinGW 92cf4e063 <Wes McKinney> Code review feedback 91555b645 <Wes McKinney> Fix another new MSVC warning 3f45fef0a <Wes McKinney> Check more random seeds, fix warnings 8f4cd4463 <Wes McKinney> Fix DecodeArrow bug which only occurred when there are nulls at the end of the data 5dc00b1ff <Wes McKinney> Fix MSVC compilation warnings 3425da4f2 <Wes McKinney> Revert change causing ASAN failure f26d7da80 <Wes McKinney> Fix up Python unit tests given schema serialization 7d663d5ac <Wes McKinney> Store schema when writing from Python, add unit test to exhibit direct dictionary reads 7705fdbc8 <Wes McKinney> Automatically read dictionary fields by serializing the Arrow schema with the store_schema option 580a0ca9c <Wes McKinney> Add failing unit test for Arrow store schema option 28268d624 <Wes McKinney> Add unit test for writing changing dictionaries de9d0a5ae <Wes McKinney> Fix null dictionary test, unit tests passing again 138053158 <Wes McKinney> Closer to full dictionary write, NA test failing 0a293ee27 <Wes McKinney> More scaffolding 8cc1bcfa9 <Wes McKinney> Unit test for PutDictionary, PutIndices 5aaf2817b <Wes McKinney> Temp d21ebd852 <Wes McKinney> Get all direct put unit tests passing edc9f8473 <Wes McKinney> Fix unit tests 57a45e0ce <Wes McKinney> Direct binary put works 1264e10cc <Wes McKinney> More direct encoding implementation stubs 882e4341f <Wes McKinney> TypedComparator/TypedStatistics augmentations for arrow::BinaryArray 245f44579 <Wes McKinney> ByteArray statistics specializations c4d7dc279 <Wes McKinney> Refactor and add Arrow encoder stubs c871ea971 <Wes McKinney> Refactor WriteBatch/WriteBatchSpaced to utilize helper functions Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org>