Commits


Rossi Sun authored and GitHub committed 5e68513d62b
GH-43495: [C++][Compute] Widen the row offset of the row table to 64-bit (#43389) ### Rationale for this change The row table uses `uint32_t` as the row offset within the row data buffer, effectively limiting the row data from growing beyond 4GB. This is quite restrictive, and the impact is described in more detail in #43495. This PR proposes to widen the row offset from 32-bit to 64-bit to address this limitation. #### Benefits Currently, the row table has three major limitations: 1. The overall data size cannot exceed 4GB. 2. The size of a single row cannot exceed 4GB. 3. The number of rows cannot exceed 2^32. This enhancement will eliminate the first limitation. Meanwhile, the second and third limitations are less likely to occur. Thus, this change will enable a significant range of use cases that are currently unsupported. #### Overhead Of course, this will introduce some overhead: 1. An extra 4 bytes of memory consumption for each row due to the offset size difference from 32-bit to 64-bit. 2. A wider offset type requires a few more SIMD instructions in each 8-row processing iteration. In my opinion, this overhead is justified by the benefits listed above. ### What changes are included in this PR? Change the row offset of the row table from 32-bit to 64-bit. Relative code in row comparison/encoding and swiss join has been updated accordingly. ### Are these changes tested? Test included. ### Are there any user-facing changes? Users could potentially see higher memory consumption when using acero's hash join and hash aggregation. However, on the other hand, certain use cases used to fail are now able to complete. * GitHub Issue: #43495 Authored-by: Ruoxi Sun <zanmato1984@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>