Commits


Rylan Dmello authored and Wes McKinney committed 1a0e976a520
PARQUET-1482: [C++] Add branch to TypedRecordReader::ReadNewPage for … …PageType::DATA_PAGE_V2 to address incompatibility with parquetjs. **Tests** This commit doesn't include tests right now; I am working on adding tests and was hoping for some initial feedback on the code changes. I may need to use an actual file generated by `parquetjs` to test this issue, so I wonder if adding `feeds1kMicros.parquet` from the JIRA task to the parquet-testing repository is an option for this. **Description** `parquetjs` seems to be writing Parquet V2 files with [`DataPageV2`](https://github.com/apache/parquet-format/blob/e93dd628d90aa076745558998f0bf5d9c262bf22/src/main/thrift/parquet.thrift#L529) pages, while `parquet-cpp` writes Parquet V2 files with [`DataPage`](https://github.com/apache/parquet-format/blob/e93dd628d90aa076745558998f0bf5d9c262bf22/src/main/thrift/parquet.thrift#L492) pages. Since `TypedRecordReader::ReadNewPage()` only had a branch for `PageType::DATA_PAGE`, the reader would return without reading any data for records that have `DATA_PAGE_V2` pages. This explains the behavior observed in [PARQUET-1482](https://issues.apache.org/jira/projects/PARQUET/issues/PARQUET-1482?filter=allopenissues). This commit adds a new if-else branch for the `DataPageV2` case in `TypedRecordReader::ReadNewPage()`. Since the `DataPageV2` branch needed to reuse the code from the `DataPage` case, I refactored the repetition/definition level decoder initialization and the data decoder initialization to two new methods in the `TypedRecordReader` class. These new methods are now called by the `DataPage` and `DataPageV2` initialization branches in `TypedRecordReader::ReadNewPage()`. There is an alternate implementation possible (with a smaller diff) by sharing the same else-if branch between `DataPage` and `DataPageV2` using a pointer-to-derived `shared_ptr<Page>`. However, since the Page superclass doesn't have the necessary `encoding()` or `num_values()` methods, I would need to add a common superclass to both `DataPage` and `DataPageV2` that defined these methods. I didn't do this because I was hesitant to modify the `Page` class hierarchy for this commit. Author: Wes McKinney <wesm+git@apache.org> Author: rdmello <rylan.dmello@mathworks.com> Author: Rylan Dmello <rdmello@users.noreply.github.com> Closes #3312 from rdmello/parquet_1482 and squashes the following commits: c5cb0f37 <Wes McKinney> Add DataPage base class for DataPageV1 and DataPageV2 8df8328e <rdmello> PARQUET-1482: Adding basic unit test for DataPageV2 serialization and deserialization. 9df32222 <Rylan Dmello> PARQUET-1482: Add branch to TypedRecordReader::ReadNewPage for PageType::DATA_PAGE_V2 to address incompatibility with parquetjs.