Commits


Hatem Helal authored and Wes McKinney committed fd0b90a7f7e
ARROW-3769: [C++] Add support for reading non-dictionary encoded binary Parquet columns directly as DictionaryArray This patch addresses the following JIRAS: * [ARROW-3769](https://issues.apache.org/jira/browse/ARROW-3769): refactored record reader logic to toggle between the different builder depending on the column type (String or Binary) and the requested array type (Chunked "dense" or Dictionary). These changes are covered by unittests and benchmarks. * [PARQUET-1537](https://issues.apache.org/jira/browse/PARQUET-1537): fixed increment and covered by unittests. Also included is an experimental class `ArrowReaderProperties` that can be used to select which columns are read directly as an `arrow::DictionaryArray`. I think some more work is needed to fully address the requests in [ARROW-3772](https://issues.apache.org/jira/browse/ARROW-3772). Namely, the ability automatically infer which columns in a parquet file should be read as `DictionaryArray`. My current thinking is that this would be solved by introducing optional arrow type metadata to files written with the `parquet::arrow::FileWriter`. There are some limitations with this approach but it would seem to satisfy the requests for users working with parquet files within the supported arrow ecosystem. Note that the behavior with this patch is that incremental reading of a parquet file will not resolve the global dictionary for all of the row groups. There are a few possible solutions for this: * Introduce a concept of an "unknown" dictionary. This will enable concatenating multiple row groups together so long as we define unknown dictionaries as equal (assuming indices have the same data type) * Add an API for merging the schemas from multiple tables together. This could be used after reading multiple row groups to enable concatenating the tables together into one. * Add an API for inferring the global dictionary for the entire file. This could be an expensive operation so ideally would be made optional. * Allow a user-specified dictionary. This could be useful in the limited case where a caller already knows the global dictionary list (computed through some other mechanism). Author: Hatem Helal <hhelal@mathworks.com> Author: Hatem Helal <hatem.helal@gmail.com> Author: Hatem Helal <Hatem.Helal@mathworks.co.uk> Closes #3721 from hatemhelal/arrow-3769 and squashes the following commits: f644fff9c <Hatem Helal> Move schema fix logic to post-processing step 023c022c3 <Hatem Helal> Add virtual destructor to WrappedBuilderInterface 99e9dee12 <Hatem Helal> Removed dependencies on arrow builder in parquet/encoding 2026b513c <Hatem Helal> Rework ByteArrayDecoder interface to reduce code duplication 5bc933b97 <Hatem Helal> use PutSpaced in test setup to correctly initialize encoded data 2c8fa7efd <Hatem Helal> revert incorrect changes to PlainByteArrayDecoder::DecodeArrow method 7719b944f <Hatem Helal> Use random string generator instead of poor JSON e6ca0db43 <Hatem Helal> Fix DictEncoding test: need to use PutSpaced instead of Put in setup 9da133142 <Hatem Helal> Temporarily disable tests for arrow builder decoding from dictionary encoded col 7347cfa26 <Hatem Helal> Fix DecodeArrow from plain encoded columns 5fb9e860a <Hatem Helal> Rework parquet encoding tests 4d7bb30de <Hatem Helal> Refactor dictionary data generation into RandomArrayGenerator 6e65fdbdf <Hatem Helal> simplify ArrowReaderProperties and mark as experimental babe52e38 <Hatem Helal> replace deprecated ReadableFileInterface with RandomAccessFile a267a27d4 <Hatem Helal> remove unnecessary inlines 7aac84c45 <Hatem Helal> Reworked encoding benchmark to reduce code duplication 077a8f1ae <Hatem Helal> Move function definition to (hopefully) resolve appveyor build failure due to C2491 a35754456 <Hatem Helal> Basic unittests for reading DictionaryArray directly from parquet a6740f31e <Hatem Helal> Make sure to update the schema when reading a column as a DictionaryArray a8c15354e <Hatem Helal> Add support for requesting a parquet column be read as a DictionaryArray 28d76b7b2 <Hatem Helal> Add benchmark for dictionary decoding using arrow builder 8f59198e8 <Hatem Helal> Add overloads for decoding using a StringDictionaryBuilder b16eaa978 <Hatem Helal> prefer default_random_engine to avoid potential slowdown with Mersenne Twister prng ff380211c <Hatem Helal> prefer mersenne twister prng over default one which is implemenation defined 78eddb8af <Hatem Helal> Use value parameterization in decoding tests 84df23bfa <Hatem Helal> prefer range-based for loop to KeepRunning while loop pattern f234ca2a2 <Hatem Helal> respond to code review feedback - many readability fixes in benchmark and tests 4fbcf1fab <Hatem Helal> fix loop increment in templated PlainByteArrayDecoder::DecodeArrow method 39a5f1994 <Hatem Helal> fix appveyor windows failure 89de5d5be <Hatem Helal> rework data generation so that decoding benchmark runs using a more realistic dataset ef55081a0 <Hatem Helal> added benchmarks for decoding plain encoded data using arrow builders 31667ffbb <Hatem Helal> added tests for DictByteArrayDecoder and reworked previous tests 4a26f7405 <Hatem Helal> remove todo message fa504158f <Hatem Helal> Implement DecodeArrowNonNull and unit tests 21fc45083 <Hatem Helal> Add some basic unittests that exercise the DecodeArrow methods