Public / arrow / 3b7ad9d4e94

Commits

Rossi Sun authored and GitHub committed 3b7ad9d4e9410 Jul 2024

GH-43129: [C++][Compute] Fix the unnecessary allocation of extra bytes when encoding row table (#43125)

### Rationale for this change

As described in #43129 , current row table occupies more memory than expected. The memory consumption is double of necessary. The reason listed below.

When encoding var length columns into into the row table:
https://github.com/apache/arrow/blob/e59832fb05dc40a85fa63297c77c8f134c9ac8e0/cpp/src/arrow/compute/row/encode_internal.cc#L155-L162

We first call `AppendEmpty` to reserve space for `x` rows but `0` bytes. This is to reserve enough size for the underlying fixed-length buffers: null masks and offsets (for var-length columns).

Then we call `GetRowOffsetsSelected` to populate the offsets.

At last we call `AppendEmpty` again with `0` rows but `y` bytes, where `y` is the last offset element which is essentially the whole size of the var-length columns.

Sounds all reasonable so far.

However, `AppendEmpty` calls `ResizeOptionalVaryingLengthBuffer`, in which:
https://github.com/apache/arrow/blob/e59832fb05dc40a85fa63297c77c8f134c9ac8e0/cpp/src/arrow/compute/row/row_internal.cc#L294-L303

We calculate `bytes_capacity_new` by keeping doubling it until it's big enough for `num_bytes + num_extra_bytes`.

Note by the time of this point, `num_bytes == offsets()[num_rows_]` is already `y`, meanwhile `num_extra_bytes` is also `y`, hence the unexpected doubled size than necessary.

### What changes are included in this PR?

Fix the wasted half size for buffers in row table. Also add tests to make sure the buffer size is as expected.

### Are these changes tested?

UT included.

### Are there any user-facing changes?

None.

* GitHub Issue: #43129

Authored-by: Ruoxi Sun <zanmato1984@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>