GH-45485: [Dev] Simplify pull request template (#45599)### Rationale for this change
It seems that the current comment based pull request template isn't read carefully.
### What changes are included in this PR?
* Remove explanations as comments
* Keep a basic introduction in the top as a normal text not a comment
* Use normal texts not comments for breaking changes and critical fix
### Are these changes tested?
No.
### Are there any user-faci...
GH-45591: [C++][Acero] Refine hash join benchmark and remove openmp from the project (#45593)
### Rationale for this change
See #45591 .
### What changes are included in this PR?
1. Replace the usage of openmp with arrow-native multi-threading primitives;
2. Remove all the occurrences of openmp from the project;
3. Support stats for build side rows in hash join benchmark, and update certain benchmark.
### Are these changes tested?
Manually tested.
### Are there any user-facing c...
GH-45587: [C++][Docs] Fix the statistics schema link in `arrow::RecordBatch::MakeStatisticsArray()`'s docstring (#45588)### Rationale for this change
`arrow::RecordBatch::MakeStatisticsArray()`'s docstring uses https://arrow.apache.org/docs/format/CDataInterfaceStatistics.html not https://arrow.apache.org/docs/format/StatisticsSchema.html for statistics schema URL.
Because https://github.com/apache/arrow/pull/44252 assumed that we use https://github.com/apache/arrow/pull/43553 but we use https://github.com/apa...
GH-45568: [C++][Parquet][CMake] Enable zlib automatically when Thrift is needed (#45569)### Rationale for this change
Required dependencies checks must be done automatically.
### What changes are included in this PR?
* Fix variable name
* Fix check order
### Are these changes tested?
Yes.
### Are there any user-facing changes?
Yes.
* GitHub Issue: #45568
Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
GH-45566: [C++][Parquet][CMake] Remove a workaround for Windows in FindThriftAlt.cmake (#45567)### Rationale for this change
In general, we want to remove workarounds as much as possible for maintainability.
### What changes are included in this PR?
https://github.com/apache/thrift/pull/2725 isn't released yet but MSYS2 has another workaround: https://github.com/msys2/MINGW-packages/blob/master/mingw-w64-thrift/002-fix-pkgconfig-paths.patch
R uses RTools packages but the Apache Thrif...
GH-45584: [C++][Thirdparty] Bump zstd to v1.5.7 (#45585)
### Rationale for this change
Zstd now released 1.5.7: https://github.com/facebook/zstd/releases/tag/v1.5.7 . It has an optimization that it improves speed for small blocks
> The compression speed for small data blocks has been notably improved at fast compression levels, thanks to contributions from TocarIP, further extended in https://github.com/facebook/zstd/pull/4165. Below are benchmar...
GH-39010: [Python] Introduce `maps_as_pydicts` parameter for `to_pylist`, `to_pydict`, `as_py` (#45471)### Rationale for this change
Currently, unfortunately `MapScalar`/`Array` types are not deserialized into proper Python `dict`s, which is unfortunate since this breaks "roundtrips" from Python -> Arrow -> Python:
```
import pyarrow as pa
schema = pa.schema([pa.field('x', pa.map_(pa.string(), pa.int64()))])
data = [{'x': {'a': 1}}]
pa.RecordBatch.from_pylist(data, schema=schema).to_pylist()
...
GH-45570: [Python] Allow Decimal32/64Array.to_pandas (#45571)
### Rationale for this change
Enables converting `Decimal32Array` and `Decimal64Array` to pandas
### What changes are included in this PR?
Adds `Type::DECIMAL32` and `Type::DECIMAL64` as convertible types to pandas
### Are these changes tested?
Yes
### Are there any user-facing changes?
Yes
closes https://github.com/apache/arrow/issues/45570
* GitHub Issue: #45570
Lead-authored-by: ...
GH-45572: [C++][Compute] Add rank_normal function (#45573)### Rationale for this change
Computing ranks as values of the "probit" function (https://en.wikipedia.org/wiki/Probit), rather than quantiles between 0 and 1, can be useful for machine learning and other tasks.
### What changes are included in this PR?
Add a "rank_normal" function that computes array ranks as points on the normal distribution.
It is similar to calling the "rank_quantile" f...
GH-45578: [C++] Use max not min in MakeStatisticsArrayMaxApproximate test (#45579)
### Rationale for this change
The test was written for min instead of max.
### What changes are included in this PR?
Use max not min for MaxApproximate test case.
### Are these changes tested?
Yes, by dedicated unit tests.
### Are there any user-facing changes?
No
* GitHub Issue: #45578
Authored-by: arash andishgar <arashandishgar1@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code...
GH-45545: [C++][Parquet] Add missing includes (#45554)
### Rationale for this change
This fixed a compile error under Windows and MacOS when attempting to package this library as a Conan package: https://github.com/conan-io/conan-center-index/pull/26623
### What changes are included in this PR?
It adds 2 missing headers from the STL, array and vector, which cause resulting compiler errors.
### Are these changes tested?
They are tested and pa...
MINOR: [R] Clean up a linting issue from #45261 (#45575)### Rationale for this change
Cleanup a minor formatting issue introduced in #45261
### What changes are included in this PR?
Remove two new lines
### Are these changes tested?
Yes operative changes, linter should pass
### Are there any user-facing changes?
No
Authored-by: Jonathan Keane <jkeane@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
GH-44924: [R] Remove usage of cpp11's HAS_UNWIND_PROTECT (#45261)
### Rationale for this change
The macro is no longer required on R >= 4.0 which is our minimum version.
### What changes are included in this PR?
Remove use of HAS_UNWIND_PROTECT
### Are these changes tested?
ci
### Are there any user-facing changes?
no
* GitHub Issue: #44924
Authored-by: Jacob Wujciak-Jens <jacob@wujciak.de>
Signed-off-by: Jacob Wujciak-Jens <jacob@wujciak.de>
GH-45536: [Dev][R] Update code to match new linters on lintr=3.2.0 (#45556)### Rationale for this change
CI jobs failing with lintr >= 3.2.0
### What changes are included in this PR?
Remove some comments and update some lintr config to ensure compatibility with lintr package version 3.2.0
### Are these changes tested?
Nope
### Are there any user-facing changes?
Nope
* GitHub Issue: #45536
Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <t...
GH-41816: [C++] Add Minimal Meson Build of libarrow (#45441)
### Rationale for this change
The Meson build system may be more user friendly to some developers, and may make it easier to perform tasks like valgrind, coverage, or ASAN/UBSAN coverage. There is also a prior art for using meson in the nanoarrow and arrow-adbc projects.
### What changes are included in this PR?
This PR implements a Meson configuration that can build a minimal libarrow.
#...
GH-45505: [CI][R] Use Ubuntu 22.04 instead of 20.04 as much as possible for nightly jobs (#45507)### Rationale for this change
Ubuntu 20.04 will reach EOL on 2025-05.
### What changes are included in this PR?
* Use Ubuntu 22.04 instead of Ubuntu 20.04 for Apache Arrow C++ 4.0.0 or later.
* Keep using Ubuntu 20.04 for Apache Arrow C++ 3.0.0 or earlier because we can't build Apache Arrow C++ 3.0.0 or earlier on Ubuntu 22.04. We can use pre-built binaries forApache Arrow C++ 3.0.0 or earli...
GH-43519: [Python][CI] Update Python 3.13 rc to final 3.13.2 (#44375)### Rationale for this change
The final Python 3.13.0 is out now, so we can update those versions
* GitHub Issue: #43519
Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>
GH-45551: [C++][Acero] Release temp states of Swiss join building hash table to reduce memory consumption (#45552)
### Rationale for this change
#45551 describes the basic idea. Some profiling from real cases follows.
Take https://github.com/apache/arrow/blob/a53a77c93217399c4fda8c6328db2c492a30b0b0/cpp/src/arrow/acero/hash_join_node_test.cc#L3368 and print the memory pool stats at the end.
Before this change:
```
heap stats: peak total freed current unit count
reserved: ...
GH-45506: [C++][Acero] More overflow-safe Swiss table (#45515)
### Rationale for this change
See #45506.
### What changes are included in this PR?
1. Abstract current overflow-prone block data access into functions that do proper type promotion to avoid overflow. Also remove the old block base address accessor.
2. Unify the data types used for various concepts as they naturally are (i.e., w/o explicit promotion): `uint32_t` for `block_id`, `int` for `...
GH-45543: [Release][C#] Remove NuGet references in script (#45544)### Rationale for this change
Fixes https://github.com/apache/arrow/issues/45543.
### What changes are included in this PR?
- Edited dev/release/post-03-binary.sh
### Are these changes tested?
No, but I feel okay with them.
### Are there any user-facing changes?
No.
* GitHub Issue: #45543
Authored-by: Bryce Mecum <petridish@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
GH-45541: [Doc][C++] Render ASCII art as-is (#45542)### Rationale for this change
[union c_type]https://arrow.apache.org/docs/cpp/api/datatype.html#_CPPv4N5arrow14BinaryViewType6c_typeE in the BinaryViewType class describes data layout using ASCII art. It rendered a non-readable layout.
```
- Entirely inlined string data |-—|———–—| ^ ^ | | size in-line string data, zero padded
- Reference into a buffer |-—|-—|-—|...
GH-45521: [CI][Dev][R] Install required cyclocomp package to be used with R lintr (#45524)### Rationale for this change
The linting jobs are failing due to the new version of `lintr` not installing `cyclocomp` anymore.
We use `cyclocomp` but this is not part of the default linters of `lintr` anymore. We should install it individually.
### What changes are included in this PR?
Install `cyclocomp` as part of setting up the linting environment for R on our linting job.
Pin old versi...
GH-45508: [CI][R] Remove Ubuntu version from sanitizer jobs (#45509)### Rationale for this change
`ubuntu-r-sanitizer` and `ubuntu-r-valgrind` use `wch1/r-debug` as their base image.
So we can't control Ubuntu versions for them.
### What changes are included in this PR?
Remove Ubuntu version from their configurations.
### Are these changes tested?
Yes.
### Are there any user-facing changes?
No.
* GitHub Issue: #45508
Lead-authored-by: Sutou Kouhei <kou...
GH-45537: [CI][C++] Add missing includes (iwyu) to file_skyhook.cc (#45538)### Rationale for this change
The job is failing because it was using some includes from a different header file `cpp/src/arrow/acero/options.h`. The file removed some of those includes and now they were missing.
### What changes are included in this PR?
Add missing includes
### Are these changes tested?
Yes, via archery
### Are there any user-facing changes?
No
* GitHub Issue: #45537
A...
GH-44905: [Dev] Remove unused file with only header (#45526)### Rationale for this change
There is a stray file that is not used anymore.
It was added here: https://github.com/apache/arrow/commit/e7e399db5fc6913e67426514279f81766a0778d2
and was used at: `java/format/pom.xml`
### What changes are included in this PR?
Remove the unused file.
### Are these changes tested?
No
### Are there any user-facing changes?
No
* GitHub Issue: #44905
Authored...
GH-45478: [CI][C++] Drop support for Ubuntu 20.04 (#45519)### Rationale for this change
Ubuntu 20.04 will read EOL on 2025-05.
### What changes are included in this PR?
Remove jobs that use Ubuntu 20.04 or replace them with Ubuntu 22.04.
### Are these changes tested?
Yes.
### Are there any user-facing changes?
No.
* GitHub Issue: #45478
Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>
GH-45528: [GLib] garrow_data_type_new_raw() returns GARROW_TYPE_BINARY_VIEW_DATA_TYPE (#45529)### Rationale for this change
The #44656 introduced `GArrowStringViewDataType`.
It was missed the one work. It is necessary that `garrow_data_type_new_raw()` return `GARROW_TYPE_BINARY_VIEW_DATA_TYPE`.
### What changes are included in this PR?
`garrow_data_type_new_raw()` returns `GARROW_TYPE_BINARY_VIEW_DATA_TYPE`
if the input data type is `arrow::Type::type::BINARY_VIEW`.
### Are these ch...
GH-45514: [CI][C++][Docs] Set CUDAToolkit_ROOT explicitly in debian-docs (#45520)### Rationale for this change
CMake's `FindCUDAToolkit.cmake` uses `/usr/lib/cuda/` as the default prefix but Debian's `nvidia-cuda-dev` uses `/usr` as prefix.
### What changes are included in this PR?
Set `CUDAToolkit_ROOT=/usr` explicitly.
### Are these changes tested?
Yes.
### Are there any user-facing changes?
No.
* GitHub Issue: #45514
Authored-by: Sutou Kouhei <kou@clear-code.com>...
GH-33592: [C++] support casting nullable fields to non-nullable if there are no null values (#43782)* GitHub Issue: #33592
Notes for myself/fixer:
- [tests that need to get updated](https://github.com/search?q=repo%3Aapache%2Farrow%20%22cannot%20cast%20nullable%20field%22&type=code) (almost definitely not a complete list)
- [update: actually we should handle the go implementation in the go repository.] hmm, looks like [go wrapper does its own nullability checks](https://github.com/apache/arr...
GH-45159: [CI][Integration] Remove substrait consumer-testing integration job (#45463)### Rationale for this change
The job has been failing for the last two months due to upstream refactoring. I opened an issue upstream but we didn't got any response:
- https://github.com/substrait-io/consumer-testing/issues/196
Based on the following substrait consumer-testing PR there was a big refactor which removed the files that we were invoking on our tests:
- https://github.com/substra...
GH-45512: [C++] Clean up undefined symbols in libarrow without IPC (#45513)### Rationale for this change
When building the Arrow library without IPC, the library ends up with undefined symbols to functions that are only available with ARROW_IPC=ON
### What changes are included in this PR?
Use the ARROW_IPC macro to detect if IPC is being used, and when not, return a NotImplementedError
### Are these changes tested?
Compiles cleanly and no longer shows undefined I...
GH-45517: [GLib] garrow_data_type_new_raw() returns GARROW_TYPE_STRING_VIEW_DATA_TYPE (#45518)### Rationale for this change
The #44686 introduced `GArrowStringViewDataType`.
It was missed the one work. It is necessary that `garrow_data_type_new_raw()` returns `GARROW_TYPE_STRING_VIEW_DATA_TYPE`.
### What changes are included in this PR?
`garrow_data_type_new_raw()` returns `GARROW_TYPE_STRING_VIEW_DATA_TYPE`
if the input data type is `arrow::Type::type::STRING_VIEW`.
### Are these ...
GH-44950: [C++] Bump minimum CMake version to 3.25 (#44989)### Rationale for this change
We want to upgrade our CMake version to 3.25 as discussed on the ML:
https://lists.apache.org/thread/h8jp16ktrj11fmjmjhlg6xvkvv9wzvjk
### What changes are included in this PR?
- Bump minimal CMake version to 3.25
- Manually install CMake on distributions where CMake < 3.25 was installed via package repositories
- Minor fixes to CI in order to have passing builds...
GH-45491: [GLib] Require Meson 0.61.2 or later (#45492)### Rationale for this change
Ubuntu 20.04 that provides Meson 0.53.2 will reach EOL on 2025-05.
Ubuntu 22.04 provides Meson 0.61.2. So we can require Meson 0.61.2 or later.
### What changes are included in this PR?
* Require Meson 0.61.2 or later
* Remove codes for old Meson
### Are these changes tested?
Yes.
### Are there any user-facing changes?
No.
* GitHub Issue: #45491
Authored-b...
GH-45497: [C++][CSV] Avoid buffer overflow when a line has too many columns (#45498)### What changes are included in this PR?
1. Add guard against going past the buffer's end, while minimizing the performance overhead of the runtime check.
2. Add error propagation for buffer (re)allocation, instead of aborting. This is unrelated to the reported issue, but is desirable nevertheless.
With these changes, a CSV line with an unexpectedly large number of columns will raise an erro...
GH-45510: [CI][C++] Fix LLVM APT repository preparation on Debian (#45511)### Rationale for this change
The existing LLVM APT repository preparation is broken. For example, it uses unavailable `${available_llvm}`.
### What changes are included in this PR?
* Use `.asc` for armored key
* Use deb822 format for APT source: https://manpages.debian.org/bookworm/dpkg-dev/deb822.5.en.html
### Are these changes tested?
Yes.
### Are there any user-facing changes?
No.
* ...
GH-45377: [CI][R] Ensure install R on ubuntu-24.04 runner for R nightly build jobs (#45464)### Rationale for this change
Related to #45377.
As a result of the recent change in the ubuntu-latest GHA runner from ubuntu-22.04 to ubuntu-24.04, it appears that the tools included in the runner from the start have changed and jobs that assume R is already installed will now fail.
Installing R the job should now succeed.
### What changes are included in this PR?
### Are these changes test...
GH-45486: [GLib] Add `GArrowArrayStatistics` (#45490)### Rationale for this change
GLib should be able to use `arrow::ArrayStatistics`.
### What changes are included in this PR?
Add `GArrowArrayStatistics` with minimal features.
### Are these changes tested?
Yes.
### Are there any user-facing changes?
Yes.
* GitHub Issue: #45486
Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
GH-45266: [C++][Acero] Fix the running tasks count of Scheduler when get error tasks in multi-threads (#45268)
### Rationale for this change
When the TaskGroup should be canceled, it will move the number which not-start to finished to avoid do them(in `TaskSchedulerImpl::Abort`). But this is one operation that happens in multi-threads. At the same time, maybe some task start to running and happen some error. Then they will return the bad status.
But the tasks are running for Scheduler, they will ju...
GH-45499: [CI] Bump actions/cache version on GHA (#45500)### Rationale for this change
Older versions of the `actions/cache` GitHub action are being deprecated as explained in https://github.com/actions/cache/discussions/1510.
Because of this, some CI jobs have started to fail: https://github.com/apache/arrow/actions/runs/13265539807/job/37034895918
### Are these changes tested?
Yes, by construction.
### Are there any user-facing changes?
No.
* ...
GH-37630: [C++][Python][Dataset] Allow disabling fragment metadata caching (#45330)### Rationale for this change
Parquet file fragments currently cache their (Parquet) metadata for later accesses when scanning has finished.
This can produce surprisingly high memory consumption in cases where:
1. the dataset is only scanned once, rather than repeatedly (this is very common)
2. there is a high metadata-to-data ratio; this can happen when the schemas on disk are very wide, with...
GH-45301: [C++] Change PrimitiveArray ctor to protected (#45444)
### Rationale for this change
This patch handles the case in GH-45301, changing the ctor for PrimitiveArray to private.
### What changes are included in this PR?
change the ctor for PrimitiveArray to private.
### Are these changes tested?
Yes
### Are there any user-facing changes?
This PR makes protected a constructor that was public. Calling this constructor outside of subclasses resu...
MINOR: [C#] Bump Microsoft.NET.Test.Sdk from 17.12.0 to 17.13.0 in /csharp (#45489)Bumps [Microsoft.NET.Test.Sdk](https://github.com/microsoft/vstest) from 17.12.0 to 17.13.0.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a href="https://github.com/microsoft/vstest/releases">Microsoft.NET.Test.Sdk's releases</a>.</em></p>
<blockquote>
<h2>v17.13.0</h2>
<h2>What's Changed</h2>
<ul>
<li>
<p>Add letter number among valid identifiers in class name by <a href="ht...
MINOR: [C#] Bump xunit.runner.visualstudio from 3.0.1 to 3.0.2 in /csharp (#45488)Bumps [xunit.runner.visualstudio](https://github.com/xunit/visualstudio.xunit) from 3.0.1 to 3.0.2.
<details>
<summary>Commits</summary>
<ul>
<li><a href="https://github.com/xunit/visualstudio.xunit/commit/dd36e86129dcb108d86eb3650eba5fae5fc4c60a"><code>dd36e86</code></a> v3.0.2</li>
<li><a href="https://github.com/xunit/visualstudio.xunit/commit/b67d776b63cff86a5455df86553212fd494329dc"><code>...
GH-44629: [C++][Acero] Use `implicit_ordering` for `asof_join` rather than `require_sequenced_output` (#44616)### Rationale for this change
Changes in #44083 (GH-41706) unnecessarily sequences batches retrieved from scanner where it only requires the batches to provide index according to implicit input order.
### What changes are included in this PR?
Setting `implicit_ordering` causes existing code to set batch index, which is then available to the `asof_join` node to sequence the batches int input or...
GH-45295: [Python][CI] Make download_tzdata_on_windows more robust and use tzdata package for tzinfo database on Windows for ORC (#45425)### Rationale for this change
We have two Windows issues and this PR is addressing both:
1. PyArrow's `download_tzdata_on_windows` can fail due to TLS issues in certain CI environments.
2. The Python wheel test infrastructure needs a tzinfo database for ORC and the automation fetching that started failing because the URL was made invalid upstream.
These two issues are being solved in one PR ...
GH-45389: [CI][R] Use Ubuntu 22.04 for test-r-versions (#45475)### Rationale for this change
Ubuntu 20.04 will reach EOL on 2025-05.
### What changes are included in this PR?
Use Ubuntu 22.04 instead of 20.04.
### Are these changes tested?
Yes.
### Are there any user-facing changes?
No.
* GitHub Issue: #45389
Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>