Commits

Wes McKinney authored a79cc809883
ARROW-585: [C++] Experimental public API for user-defined extension types and arrays This patch proposes a public API for user-defined C++ types that can be sent and received faithfully in Arrow's IPC protocol. Summary of approach: * User implements subclass of `arrow::ExtensionType`, which wraps an underlying "storage type", how the data is represented in memory. This includes serialization, deserialization, and array wrapper APIs * User implements subclass of `arrow::ExtensionArray`, their custom container for the user-defined type. This wraps data matching the "storage" type * The extension type is registered globally with `arrow::RegisterExtensionType` * Extension type metadata is embedded in the `Field::custom_metadata` Flatbuffers field in two keys, `arrow_extension_name` and `arrow_extension_data`. This represent the name of the type and the serialized internals of the type, if any * If a receiver does not have any special handling for the extension type, they can still handle the data as though it were an instance of the storage type I implemented an example `UUIDType` in the unit tests. It is implemented like this: * The extension type name is `"uuid"` * The storage type is `fixed_size_binary(16)` One issue I uncovered while working on this is that `DataType::Equals` does not compare Field metadata for nested fields. I implemented this and will open a JIRA about doing some follow-on testing to harden this. I also implemented ARROW-572 in this patch which modifies the IPC metadata serialization to use the visitor pattern, removing a long-standing TODO Per ARROW-1587 I would like to have extension types as a formal construct in the protocol, so I will propose additions to the Flatbuffers files in a separate patch, and then we can easily change the implementation here to conform to whatever decision is reached in the protocol. Some next steps would be to provide a way for UDT's to be implemented in pure Python. Author: Wes McKinney <wesm+git@apache.org> Closes #3694 from wesm/ARROW-585 and squashes the following commits: 696e4f29 <Wes McKinney> Add missing switch case 1f3595b5 <Wes McKinney> Add example parametric types, one where extension name is constant, another where not 33601714 <Wes McKinney> Test schema serialization, test nested array, and test what happens when we do an IPC read of a type we don't know 147b2b70 <Wes McKinney> Add NotImplemented cases for visitors that don't yet handle extension types af413857 <Wes McKinney> Refactor to have only ExtensionType as the class to implement, no multiple inheritance a870b7e4 <Wes McKinney> Proposed public API for user-defined C++ extension types that can round-trip the Arrow IPC protocol