Commits

Wes McKinney authored e68ca7f9aed
ARROW-3144: [C++/Python] Move "dictionary" member from DictionaryType to ArrayData to allow for variable dictionaries This patch moves the dictionary member out of DictionaryType to a new member on the internal ArrayData structure. As a result, serializing and deserializing schemas requires only a single IPC message, and schemas have no knowledge of what the dictionary values are. The objective of this change is to correct a long-standing Arrow C++ design problem with dictionary-encoded arrays where the dictionary values must be known at schema construction time. This has plagued us all over the codebase: * In reading Parquet files, reading directly to DictionaryArray is not simple because each row group may have a different dictionary * In IPC streams, delta dictionaries (not yet implemented) would invalidate the pre-existing schema, causing subsequent RecordBatch objects to be incompatible * In Arrow Flight, schema negotiation requires the dictionaries to be sent, having possibly unbounded size. * Not possible to have different dictionaries in a ChunkedArray * In CSV files, converting columns to dictionary in parallel would require an expensive type unification The summary of what can be learned from this is: do not put data in type objects, only metadata. Dictionaries are data, not metadata. There are a number of unavoidable API changes (straightforward for library users to fix) but otherwise no functional difference in the library. As you can see the change is quite complex as significant parts of IPC read/write, JSON integration testing, and Flight needed to be reworked to alter the control flow around schema resolution and handling the first record batch. Key APIs changed * `DictionaryType` constructor requires a `DataType` for the dictionary value type instead of the dictionary itself. The `dictionary` factory method is correspondingly changed. The `dictionary` accessor method on `DictionaryType` is replaced with `value_type`. * `DictionaryArray` constructor and `DictionaryArray::FromArrays` must be passed the dictionary values as an additional argument. * `DictionaryMemo` is exposed in the public API as it is now required for granular interactions with IPC messages with such functions as `ipc::ReadSchema` and `ipc::ReadRecordBatch` * A `DictionaryMemo*` argument is added to several low-level public functions in `ipc/writer.h` and `ipc/reader.h` Some other incidental changes: * Because DictionaryType objects could be reused previous in Schemas, such dictionaries would be "deduplicated" in IPC messages in passing. This is no longer possible by the same trick, so dictionary reuse will have to be handled in a different way (I opened ARROW-5340 to investigate) * As a result of this, an integration test that featured dictionary reuse has been changed to not reuse dictionaries. Technically this is a regression, but I didn't want to block the patch over it * R is added to allow_failures in Travis CI for now Author: Wes McKinney <wesm+git@apache.org> Author: Kouhei Sutou <kou@clear-code.com> Author: Antoine Pitrou <antoine@python.org> Closes #4316 from wesm/ARROW-3144 and squashes the following commits: 9f1ccfbf4 <Kouhei Sutou> Follow DictionaryArray changes 89e274da5 <Wes McKinney> Do not reuse dictionaries in integration tests for now until more follow on work around this can be done f62819f5b <Wes McKinney> Support many fields referencing the same dictionary, fix integration tests 37e82b4da <Antoine Pitrou> Fix CUDA and Duration issues 037075083 <Wes McKinney> Add R to allow_failures for now bd04774e2 <Wes McKinney> Code review comments b1cc52e62 <Wes McKinney> Fix rest of Python unit tests, fix some incorrect code comments f1178b2a6 <Wes McKinney> Fix all but 3 Python unit tests ab7fc1741 <Wes McKinney> Fix up Cython compilation, haven't fixed unit tests yet though 6ce51ef79 <Wes McKinney> Get everything compiling again e23c578fd <Wes McKinney> Fix Parquet tests c73b2162f <Wes McKinney> arrow-tests all passing again, huzzah! 04d40e8e6 <Wes McKinney> Flat dictionary IPC test passing now 481f316dc <Wes McKinney> Get JSON integration tests passing again 77a43dc9f <Wes McKinney> Fix pretty_print-test f4ada6685 <Wes McKinney> array-tests compilers again 8276dce6c <Wes McKinney> libarrow compiles again 8ea0e260a <Wes McKinney> Refactor IPC read path for new paradigm a1afe879a <Wes McKinney> More refactoring to have correct logic in IPC paths, not yet done aed04304e <Wes McKinney> More refactoring, regularize some type names 6bd72f946 <Wes McKinney> Start porting changes 24f99f16b <Wes McKinney> Initial boilerplate