Commits


Sutou Kouhei authored and GitHub committed 1e45e182a71
GH-38837: [Format] Add the specification for statistics schema (#45058) ### Rationale for this change Statistics are useful for fast query processing. Many query engines use statistics to optimize their query plan. Apache Arrow format doesn't have statistics but other formats that can be read as Apache Arrow data may have statistics. For example, Apache Parquet C++ can read Apache Parquet file as Apache Arrow data and Apache Parquet file may have statistics. One of the Apache Arrow C streaming interface use cases is the following: 1. Module A reads Apache Parquet file as Apache Arrow data 2. Module A passes the read Apache Arrow data to module B through the Arrow C data interface 3. Module B processes the passed Apache Arrow data If module A can pass the statistics associated with the Apache Parquet file to module B, module B can use the statistics to optimize its query plan. ### What changes are included in this PR? We standardize how to represent statistics as an Apache Arrow array for easy to exchange. We don't standardize how to pass the statistics array. You can use any interface for it. For example, you can us ethe Apache Arrow C data interface. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #38837 Lead-authored-by: Sutou Kouhei <kou@clear-code.com> Co-authored-by: Sutou Kouhei <kou@cozmixng.org> Co-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>