Public / arrow / e067dea9a98

Commits

Jorge C. Leitao authored and Andy Grove committed e067dea9a9821 Sep 2020
ARROW-9937: [Rust] [DataFusion] Improved aggregations

This PR is a proposal to fix 4 issues in our aggregations:

1. averages are incorrect
2. they only support aggregations that can be reduced using a single value (reason for issue 1.)
3. they do not leverage arrow’s aggregate kernels nor memory layout
4. they only support a single column

The proposal is written here: https://docs.google.com/document/d/1n-GS103ih3QIeQMbf_zyDStjUmryRQd45ypgk884LHU/edit?usp=sharing

Its main features:

1. adds a test of a wrong average and fixes it
1. makes `ScalarValue` a nullable dynamic type, which is closer to how `Array` works (it is nullable)
2. Accumulators now know how to be updated from values (partial) and from other accumulators' state (final)
3. Accumulators can now receive more than one column
4. AggregateExec now knows how to serialize aggregators' state into a ArrayRef's, so that they can be passed throughout the execution
5. Aggregations are now always made in two steps: partial (update from values) and full (update from other's states)
6. Aggregations leverage arrow's kernels as much as possible (all aggregates + take + concatenate)

This PR is built on top of 3 PRs that are under review, and thus is only a draft at this point.

The benchmarks are between -30% and +15% on my computer (AFAIK no SIMD). Given that the computation now always requires two passes, I was sufficiently happy with them. More can be achieved later.

I am still evaluating the reason for the `aggregate_query_group_by`, but given the functionality that it adds, I considered it sufficiently good for some initial discussions, @andygrove , @nevi-me , @alamb @paddyhoran .

The benchmarks were updated to better reflect real data, and the results are as follows:

```
aggregate_query_no_group_by 15 12
                        time:   [478.23 us 479.62 us 480.98 us]
                        change: [-29.686% -27.511% -25.784%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe

aggregate_query_group_by 15 12
                        time:   [2.8689 ms 2.8794 ms 2.8922 ms]
                        change: [+12.971% +13.710% +14.445%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

aggregate_query_group_by_with_filter 15 12
                        time:   [2.1803 ms 2.2062 ms 2.2330 ms]
                        change: [-8.2400% -6.7872% -5.3209%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
```

Sorry for the long PR, but this was a relatively difficult PR to achieve, as it required refactoring of some of our most delicate components. I will try to split it in smaller parts to each the review.

Closes #8172 from jorgecarleitao/aggregates2

Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Signed-off-by: Andy Grove <andygrove73@gmail.com>