Commits


TP Boudreau authored and Wes McKinney committed 38b1ddfb7f5
PARQUET-1411: [C++] Add parameterized logical annotations to Parquet metadata This PR updates the parquet-cpp implementation to use parameterized logical type annotations as requested in [PARQUET-1411](https://issues.apache.org/jira/browse/PARQUET-1411). The primary contributions are: 1. Amend the parquet.thrift file consistent with the Parquet format repository to allow Thrift to recognize and serialize a new LogicalType 2. Introduce and integrate a LogicalAnnotation class (and subclasses) in the parquet-cpp library to handle functionality related to this new attribute 3. Expand the public API to include construction functions for GroupNode and PrimitiveNode schema classes that accept the new LogicalAnnotation parameter 4. Add basic unit and integration tests for the new code. Some (hopefully time saving) notes for reviewers: - The center of gravity for the PR is the LogicalAnnotation class, so it might be best to start by having a look at it's interface in types.h - LogicalType would have been a more natural name for this class, but unfortunately that was already in use (in the public API) for the concept Thrift calls "converted type". Here's a refresher chart of some relevant concepts and their names in the code: | Concept | Thrift realization | Parquet-CPP realization | |---------|--------------------|-------------------------| | Physical storage type | enum Type | enum parquet::Type::type | | Converted type | enum ConvertedType; struct parquet::format::ConvertedType | enum parquet::LogicalType::type; struct parquet::schema::DecimalMetadata| | Logical annotation | union LogicalType; struct parquet::format::LogicalType | class parquet::LogicalAnnotation (enum parquet::LogicalAnnotation::Type) | - After this change, the LogicalAnnotation member (stored in Node.logical_annotation_) is the primary controller of logical typing in the library. To simplify backward compatibility tasks, the existing converted type/decimal metadata pair is retained and populated at construction with compatible values (where possible). - While new Make() functions are introduced to accomodate schema node construction with the new annotations, of course the existing corresponding public API functions accepting converted types are retained. The maintainers might wish to consider deprecating those functions. - As required by the parquet format specification, after this change, both the converted and logical types are serialized; the new logical type (if present) takes precedence on deserialization, but the legacy converted type is still recognized in the absence of a logical type. >The existing converted type INTERVAL has no corresponding logical type in the parquet.thrift definition, so it can only be serialized as a converted type. As with any other element in which only converted type information is present, upon deserialization an equivalent logical annotation is instantiated and it is treated internally as any other type. - There are several ancillary functions, such as printing and equivalence checking, in which the converted types are still used; in most cases these fallbacks and double-checks are probably redundant. Note, however, that in the JSON output *both* converted and logical types are intentionally emitted, on the assumption that consumers might be relying on the current format as part of a quasi-public API. The maintainers may wish to consider deprecating the converted type fields there as well. I'm far from an expert in the Parquet data format or it's implementing libraries, so I may have misunderstood or omitted things; all corrections or recommendations are obviously welcome. Author: TP Boudreau <tpboudreau@gmail.com> Closes #4185 from tpboudreau/PARQUET-1411 and squashes the following commits: 00ef84d2b <TP Boudreau> Widen access to implementation classes fdfbf87fc <TP Boudreau> Include logical annotation in method ColumnDescriptor::ToString() a5e2386b2 <TP Boudreau> Add and integrate logical annotations for schema nodes