Public / arrow / 2ef4059566e

Commits

Neal Richardson authored and GitHub committed 2ef4059566e30 Apr 2024
GH-29537: [R] Support mutate/summarize with implicit join (#41350)

### Rationale for this change

Since it doesn't look like Acero will be getting window functions any
time soon, implement support in `mutate()` for transformations that
involve aggregations, like `x - mean(x)`, via left_join.

### What changes are included in this PR?

Following #41223, I realized I could reuse that evaluation path in
`mutate()`. Evaluating expressions accumulates `..aggregations` and
`mutate_stuff`; in summarize() we apply aggregations and then mutate on
the result. If expressions in the `mutate_stuff` reference columns in
the original data and not just the result of aggregations, we reject it.

Here, if there are aggregations, we apply them on a copy of the query up
to that point, and join the result back onto the query, then apply the
mutations on that. It's not a problem for those mutate expressions to
reference both columns in the original data and the results of the
aggregations because both are present.

There are ~three~ two caveats:

* Join has non-deterministic order, so while `mutate()` doesn't
generally affect row order, if this code path is activated, row order
may not be stable. With datasets, it's not guaranteed anyway.
* ~Acero's join seems to have a limitation currently where missing
values are not joined to each other. If your join key has NA in it, and
you do a left_join, your new columns will all be NA, even if there is a
corresponding value in the right dataset. I made
https://github.com/apache/arrow/issues/41358 to address that, and in the
meantime, I've added a workaround
(https://github.com/apache/arrow/pull/41350/commits/b9de50452e926fe5f39aeb3887a04e203302b960)
that's not awesome but has the right behavior.~ Fixed and rebased.
* I believe it is possible in dplyr to get this behavior in other verbs:
filter, arrange, even summarize. I've only done this for mutate. Are we
ok with that?

### Are these changes tested?

Yes

### Are there any user-facing changes?

This works now:

``` r
library(arrow)
library(dplyr)

mtcars |>
  arrow_table() |>
  select(cyl, mpg, hp) |>
  group_by(cyl) |>
  mutate(stdize_mpg = (mpg - mean(mpg)) / sd(mpg)) |>
  collect()
#> # A tibble: 32 × 4
#> # Groups:   cyl [3]
#>      cyl   mpg    hp stdize_mpg
#>    <dbl> <dbl> <dbl>      <dbl>
#>  1     6  21     110      0.865
#>  2     6  21     110      0.865
#>  3     4  22.8    93     -0.857
#>  4     6  21.4   110      1.14 
#>  5     8  18.7   175      1.41 
#>  6     6  18.1   105     -1.13 
#>  7     8  14.3   245     -0.312
#>  8     4  24.4    62     -0.502
#>  9     4  22.8    95     -0.857
#> 10     6  19.2   123     -0.373
#> # ℹ 22 more rows
```

<sup>Created on 2024-04-23 with [reprex
v2.1.0](https://reprex.tidyverse.org)</sup>

* GitHub Issue: #29537