Commits

Andy Grove authored 1ddf7213b88
ARROW-11058: [Rust] [DataFusion] Implement coalesce batches operator This PR introduces a new `CoalesceBatchesExec` physical operator which combines small input batches and produces larger output batches. The physical optimizer inserts this operator around filters because highly selective filters can produce lots of small batches and this causes poor performance in some cases (especially joins) because we lose some of the benefits of vectorization if we have batches with single rows for example. For TPC-H q12 at SF=100 and 8 partitions, this provides the following speedups on my desktop: | Batch Size | Master | This PR | | --- | --- | --- | | 8192 | 617.5 s | 70.7 s | | 16384 | 183.1 s | 46.4 s | | 32768 | 59.4 s | 33.3 s | | 65536 | 27.5 s | 20.7 s | | 131072 | 18.4 s | 18.5 s | Note that the new `CoalesceBatchesExec` uses `MutableArrayData` which still suffers from some kind of exponential slow down as the number of batches increases, so we should be able to optimize this further, but at least we're using `MutableArrayData` to combine smaller numbers of batches now. Even if we fix the slowdown in `MutableArrayData`, we would still want `CoalesceBatchesExec` to help avoid empty/tiny batches for other reasons. Closes #9043 from andygrove/coalesce_batches Authored-by: Andy Grove <andygrove73@gmail.com> Signed-off-by: Andy Grove <andygrove73@gmail.com>