Commits

Wes McKinney authored 45e41cad3ed
ARROW-6417: [C++][Parquet] Miscellaneous optimizations yielding slightly better Parquet binary read performance A handful of things here: * Using preallocation and `UnsafeAppend` on the primary binary read path * Changed Parquet decode APIs to decode into a helper data structure to avoid the extra machinery of ChunkedBinaryBuilder. These APIs are pseudopublic (for testing purposes) and not exposed to the user, so this doesn't affect any public APIs This produces about 10% net benefit in a holistic benchmark script from Python. The microbenchmarks in parquet-encoding-benchmark are much more clear before ``` BM_ArrowBinaryPlain/DecodeArrow_Dense/1024 20204 ns 20205 ns 34568 307.562MB/s BM_ArrowBinaryPlain/DecodeArrow_Dense/4096 81111 ns 81111 ns 8581 295.352MB/s BM_ArrowBinaryPlain/DecodeArrow_Dense/32768 622801 ns 622797 ns 1102 300.966MB/s BM_ArrowBinaryPlain/DecodeArrow_Dense/65536 1245664 ns 1245663 ns 561 301.988MB/s BM_ArrowBinaryPlain/DecodeArrowNonNull_Dense/1024 17865 ns 17865 ns 39028 347.839MB/s BM_ArrowBinaryPlain/DecodeArrowNonNull_Dense/4096 71837 ns 71835 ns 9604 333.492MB/s BM_ArrowBinaryPlain/DecodeArrowNonNull_Dense/32768 564087 ns 564075 ns 1248 332.298MB/s BM_ArrowBinaryPlain/DecodeArrowNonNull_Dense/65536 1123738 ns 1123722 ns 611 334.758MB/s ``` after ``` BM_ArrowBinaryPlain/DecodeArrow_Dense/1024 5922 ns 5923 ns 115651 1049.24MB/s BM_ArrowBinaryPlain/DecodeArrow_Dense/4096 35340 ns 35340 ns 19920 677.887MB/s BM_ArrowBinaryPlain/DecodeArrow_Dense/32768 319888 ns 319882 ns 2194 585.968MB/s BM_ArrowBinaryPlain/DecodeArrow_Dense/65536 642640 ns 642640 ns 1100 585.358MB/s BM_ArrowBinaryPlain/DecodeArrowNonNull_Dense/1024 6568 ns 6568 ns 104715 946.191MB/s BM_ArrowBinaryPlain/DecodeArrowNonNull_Dense/4096 30890 ns 30890 ns 22661 775.53MB/s BM_ArrowBinaryPlain/DecodeArrowNonNull_Dense/32768 257427 ns 257426 ns 2711 728.135MB/s BM_ArrowBinaryPlain/DecodeArrowNonNull_Dense/65536 516614 ns 516600 ns 1350 728.174MB/s ``` The dictionary decoding case is unchanged; this should be optimized separately. Closes #5268 from wesm/ARROW-6417 and squashes the following commits: a4eab7da3 <Wes McKinney> Fix up parquet-encoding-benchmark 6f3335b20 <Wes McKinney> Tune performance of dense Parquet BYTE_ARRAY reads to Arrow BinaryArray Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org>