Public / arrow / bcd2e946e03

Commits

David Li authored and Wes McKinney committed bcd2e946e0302 May 2020
PARQUET-1820: [C++] pre-buffer specified columns of row group

This hooks up Antoine's read coalescing implementation to the Parquet reader; it takes into account row groups and column indices, so it should be good for both scans of the entire file and selecting a few columns out of many. It also exposes some options on the Parquet ReaderProperties to control this. (Is exposing Arrow types like that ok, or should I wrap things?)

I'll have benchmarks later. It seems a clear win locally and against remote S3, but from EC2->S3 when I initially tried it was worse. I believe it's because the "naive" read happened to be near optimal on the particular dataset tested.

Marking this WIP as I'd like to get feedback on the approach.

I believe this subsumes PARQUET-1698/#6138. This is not exposed yet to Python or Datasets.

Closes #6744 from lidavidm/parquet-1698-coalesce-reads

Authored-by: David Li <li.davidm96@gmail.com>
Signed-off-by: Wes McKinney <wesm+git@apache.org>