Commits


mwish authored and GitHub committed be1dcdb96b0
GH-38860: [C++][Parquet] Using length to optimize bloom filter read (#38863) ### Rationale for this change Parquet supports a bloom_filter_length in 2.10[1]. We'd like to using this length for read. The current implemention [2] using the code below: 1. Using a "guessed" header length to read the header. The header is likely to be 40B, but we use a larger value to avoid it evolves 2. From the header, we get the bloom filter length, and load it from input. Now, we can directly load the whole bloom-filter, without reading twice. We shouldn't remove the stale code because we need to read the stale file. We also need to generate a new parquet-testing file ( I can do this ASAP ) [1] https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L824 [2] https://github.com/apache/arrow/blob/main/cpp/src/parquet/bloom_filter.cc#L117 ### What changes are included in this PR? * [x] Support Basic read with `bloom_filter_length` * [x] Enhance the JsonPrinter * [x] testing ### Are these changes tested? * [x] testing using parquet-testing ### Are there any user-facing changes? * Closes: #38860 Lead-authored-by: mwish <maplewish117@gmail.com> Co-authored-by: mwish <1506118561@qq.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org>