Commits


Wes McKinney authored and GitHub committed 4d931ff1c0f
ARROW-16852: [C++] Migrate remaining kernels to use ExecSpan, remove ExecBatchIterator (#13630) This completes the porting to use ExecSpan everywhere. I also changed the ExecBatchIterator benchmarks to use ExecSpan to show the performance improvement in input splitting that we've talked about in the past: Splitting inputs into small ExecSpan: ``` ------------------------------------------------------------------------------------ Benchmark Time CPU Iterations UserCounters... ------------------------------------------------------------------------------------ BM_ExecSpanIterator/1024 205671 ns 205667 ns 3395 items_per_second=4.86223k/s BM_ExecSpanIterator/4096 54749 ns 54750 ns 13121 items_per_second=18.265k/s BM_ExecSpanIterator/16384 15979 ns 15979 ns 42628 items_per_second=62.5824k/s BM_ExecSpanIterator/65536 5597 ns 5597 ns 125099 items_per_second=178.668k/s ``` Splitting inputs into small ExecBatch: ``` ------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... ------------------------------------------------------------------------------------- BM_ExecBatchIterator/1024 17163432 ns 17163171 ns 41 items_per_second=58.2643/s BM_ExecBatchIterator/4096 4243467 ns 4243316 ns 163 items_per_second=235.665/s BM_ExecBatchIterator/16384 1093680 ns 1093638 ns 620 items_per_second=914.38/s BM_ExecBatchIterator/65536 272451 ns 272435 ns 2584 items_per_second=3.6706k/s ``` Because the input in this benchmark has 1M elements, this shows that splitting into 1024 chunks of size 1024 adds only 0.2ms of overhead with ExecSpanIterator versus 17.16ms of overhead with ExecBatchIterator (> 80x improvement). This won't by itself do much to impact performance in Acero but things for the community to explore in the future are the following (this work that I've been doing has been a precondition to consider this): * A leaner ExecuteScalarExpression implementation that reuses temporary allocations (ARROW-16758) * Parallel expression evaluation * Better defining morsel (~1M elements) versus task (~1K elements) granularity in execution * Work stealing so that we don't "hog" the thread pools, and we keep the work pinned to a particular CPU core if there are other things going on at the same time Authored-by: Wes McKinney <wesm@apache.org> Signed-off-by: Wes McKinney <wesm@apache.org>