Commits


Neal Richardson authored and GitHub committed 2ef4059566e
GH-29537: [R] Support mutate/summarize with implicit join (#41350) ### Rationale for this change Since it doesn't look like Acero will be getting window functions any time soon, implement support in `mutate()` for transformations that involve aggregations, like `x - mean(x)`, via left_join. ### What changes are included in this PR? Following #41223, I realized I could reuse that evaluation path in `mutate()`. Evaluating expressions accumulates `..aggregations` and `mutate_stuff`; in summarize() we apply aggregations and then mutate on the result. If expressions in the `mutate_stuff` reference columns in the original data and not just the result of aggregations, we reject it. Here, if there are aggregations, we apply them on a copy of the query up to that point, and join the result back onto the query, then apply the mutations on that. It's not a problem for those mutate expressions to reference both columns in the original data and the results of the aggregations because both are present. There are ~three~ two caveats: * Join has non-deterministic order, so while `mutate()` doesn't generally affect row order, if this code path is activated, row order may not be stable. With datasets, it's not guaranteed anyway. * ~Acero's join seems to have a limitation currently where missing values are not joined to each other. If your join key has NA in it, and you do a left_join, your new columns will all be NA, even if there is a corresponding value in the right dataset. I made https://github.com/apache/arrow/issues/41358 to address that, and in the meantime, I've added a workaround (https://github.com/apache/arrow/pull/41350/commits/b9de50452e926fe5f39aeb3887a04e203302b960) that's not awesome but has the right behavior.~ Fixed and rebased. * I believe it is possible in dplyr to get this behavior in other verbs: filter, arrange, even summarize. I've only done this for mutate. Are we ok with that? ### Are these changes tested? Yes ### Are there any user-facing changes? This works now: ``` r library(arrow) library(dplyr) mtcars |> arrow_table() |> select(cyl, mpg, hp) |> group_by(cyl) |> mutate(stdize_mpg = (mpg - mean(mpg)) / sd(mpg)) |> collect() #> # A tibble: 32 × 4 #> # Groups: cyl [3] #> cyl mpg hp stdize_mpg #> <dbl> <dbl> <dbl> <dbl> #> 1 6 21 110 0.865 #> 2 6 21 110 0.865 #> 3 4 22.8 93 -0.857 #> 4 6 21.4 110 1.14 #> 5 8 18.7 175 1.41 #> 6 6 18.1 105 -1.13 #> 7 8 14.3 245 -0.312 #> 8 4 24.4 62 -0.502 #> 9 4 22.8 95 -0.857 #> 10 6 19.2 123 -0.373 #> # ℹ 22 more rows ``` <sup>Created on 2024-04-23 with [reprex v2.1.0](https://reprex.tidyverse.org)</sup> * GitHub Issue: #29537