Public / arrow / 4d931ff1c0f

Commits

Wes McKinney authored and GitHub committed 4d931ff1c0f20 Jul 2022
ARROW-16852: [C++] Migrate remaining kernels to use ExecSpan, remove ExecBatchIterator (#13630)

This completes the porting to use ExecSpan everywhere. I also changed the ExecBatchIterator benchmarks to use ExecSpan to show the performance improvement in input splitting that we've talked about in the past:

Splitting inputs into small ExecSpan:

```
------------------------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------
BM_ExecSpanIterator/1024      205671 ns       205667 ns         3395 items_per_second=4.86223k/s
BM_ExecSpanIterator/4096       54749 ns        54750 ns        13121 items_per_second=18.265k/s
BM_ExecSpanIterator/16384      15979 ns        15979 ns        42628 items_per_second=62.5824k/s
BM_ExecSpanIterator/65536       5597 ns         5597 ns       125099 items_per_second=178.668k/s
```

Splitting inputs into small ExecBatch:

```
-------------------------------------------------------------------------------------
Benchmark                           Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------
BM_ExecBatchIterator/1024    17163432 ns     17163171 ns           41 items_per_second=58.2643/s
BM_ExecBatchIterator/4096     4243467 ns      4243316 ns          163 items_per_second=235.665/s
BM_ExecBatchIterator/16384    1093680 ns      1093638 ns          620 items_per_second=914.38/s
BM_ExecBatchIterator/65536     272451 ns       272435 ns         2584 items_per_second=3.6706k/s
```

Because the input in this benchmark has 1M elements, this shows that splitting into 1024 chunks of size 1024 adds only 0.2ms of overhead with ExecSpanIterator versus 17.16ms of overhead with ExecBatchIterator (> 80x improvement). 

This won't by itself do much to impact performance in Acero but things for the community to explore in the future are the following (this work that I've been doing has been a precondition to consider this):

* A leaner ExecuteScalarExpression implementation that reuses temporary allocations (ARROW-16758)
* Parallel expression evaluation
* Better defining morsel (~1M elements) versus task (~1K elements) granularity in execution 
* Work stealing so that we don't "hog" the thread pools, and we keep the work pinned to a particular CPU core if there are other things going on at the same time

Authored-by: Wes McKinney <wesm@apache.org>
Signed-off-by: Wes McKinney <wesm@apache.org>