Commits


Rossi Sun authored and GitHub committed 3b7ad9d4e94
GH-43129: [C++][Compute] Fix the unnecessary allocation of extra bytes when encoding row table (#43125) ### Rationale for this change As described in #43129 , current row table occupies more memory than expected. The memory consumption is double of necessary. The reason listed below. When encoding var length columns into into the row table: https://github.com/apache/arrow/blob/e59832fb05dc40a85fa63297c77c8f134c9ac8e0/cpp/src/arrow/compute/row/encode_internal.cc#L155-L162 We first call `AppendEmpty` to reserve space for `x` rows but `0` bytes. This is to reserve enough size for the underlying fixed-length buffers: null masks and offsets (for var-length columns). Then we call `GetRowOffsetsSelected` to populate the offsets. At last we call `AppendEmpty` again with `0` rows but `y` bytes, where `y` is the last offset element which is essentially the whole size of the var-length columns. Sounds all reasonable so far. However, `AppendEmpty` calls `ResizeOptionalVaryingLengthBuffer`, in which: https://github.com/apache/arrow/blob/e59832fb05dc40a85fa63297c77c8f134c9ac8e0/cpp/src/arrow/compute/row/row_internal.cc#L294-L303 We calculate `bytes_capacity_new` by keeping doubling it until it's big enough for `num_bytes + num_extra_bytes`. Note by the time of this point, `num_bytes == offsets()[num_rows_]` is already `y`, meanwhile `num_extra_bytes` is also `y`, hence the unexpected doubled size than necessary. ### What changes are included in this PR? Fix the wasted half size for buffers in row table. Also add tests to make sure the buffer size is as expected. ### Are these changes tested? UT included. ### Are there any user-facing changes? None. * GitHub Issue: #43129 Authored-by: Ruoxi Sun <zanmato1984@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>