Public / arrow / 2696951338c

Commits

Heres, Daniel authored and Jorge C. Leitao committed 2696951338c28 Jan 2021
ARROW-11300: [Rust][DataFusion] Further performance improvements on hash aggregation with small groups

Based on https://github.com/apache/arrow/pull/9234, this PR improves the situation described in https://issues.apache.org/jira/browse/ARROW-11300.

The current situation is that we call `take` on arrays, which is fine, but causes a lot of small `Arrays` to be created / allocated. when we have only a small number of rows in each group.

This improves the results on the group by queries on db-benchmark:

PR:
```
q1 took 32 ms
q2 took 422 ms
q3 took 3468 ms
q4 took 44 ms
q5 took 3166 ms
q7 took 3081 ms
```

https://github.com/apache/arrow/pull/9234 (different results from that PR description as this has now partitioning enabled and a custom allocator)

```
q1 took 34 ms
q2 took 389 ms
q3 took 4590 ms
q4 took 47 ms
q5 took 5152 ms
q7 took 3941 ms
```
The PR changes the algorithm to:

* Create indices / offsets of all keys / indices new in the batch.
* `take` the arrays based on indices in one go (so it only requires one bigger allocation for each array)
* Use `slice` based on the offsets to take values from the arrays and pass it to the accumulators.

Closes #9271 from Dandandan/hash_agg_few_rows

Authored-by: Heres, Daniel <danielheres@gmail.com>
Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>