Public / arrow / 6bd00508116

Commits

eitsupi authored and GitHub committed 6bd0050811618 May 2023
GH-35445: [R] Behavior something like group_by(foo) |> across(everything()) is different from dplyr (#35473)

### Rationale for this change

The argument `.cols` of the `dplyr::across` function has the following description.

> You can't select grouping columns because they are already automatically handled by the verb (i.e. summarise() or mutate()).

However, this behavior is currently not reproduced in the `arrow` package and an error occurs when selecting the column used for grouping with `everything()`.

``` r
mtcars |>
  arrow::as_arrow_table() |>
  dplyr::group_by(cyl) |>
  dplyr::summarise(dplyr::across(everything(), sum)) |>
  dplyr::collect()
#> Error in `compute.arrow_dplyr_query()`:
#> ! Invalid: Multiple matches for FieldRef.Name(cyl) in mpg: double
#> cyl: double
#> disp: double
#> hp: double
#> drat: double
#> wt: double
#> qsec: double
#> vs: double
#> am: double
#> gear: double
#> carb: double
#> cyl: double
#> Backtrace:
#>     ▆
#>  1. ├─dplyr::collect(...)
#>  2. └─arrow:::collect.arrow_dplyr_query(...)
#>  3.   └─arrow:::compute.arrow_dplyr_query(x)
#>  4.     └─base::tryCatch(...)
#>  5.       └─base (local) tryCatchList(expr, classes, parentenv, handlers)
#>  6.         └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
#>  7.           └─value[[3L]](cond)
#>  8.             └─arrow:::augment_io_error_msg(e, call, schema = schema())
#>  9.               └─rlang::abort(msg, call = call)
```

<sup>Created on 2023-05-05 with [reprex v2.0.2](https://reprex.tidyverse.org)</sup>

This PR fixes this behavior to match with dplyr's original behavior.

### What changes are included in this PR?

- Auto exclude grouping columns in `across` in `mutate`, `transmute`, and `summarise`.
- The `.data` argument of internal function `expand_across` should be `arrow_dplyr_query`.
  Some tests have been slightly modified to accommodate this change.
- `mutate`, `transmute`, `arrange`, `filter` always return `arrow_dplyr_query`.
  Currently, `arrow_dplyr_query` is not returned in the following cases, which was not consistent. 
  ```r
  mtcars |> arrow::arrow_table() |> dplyr::mutate()
  ```
- Correct the order of columns in results of `group_by(foo) |> mutate(.keep = "none")`
  Currently, the results of the following query show that the columns used for grouping have moved to the tail and differ from the behavior of dplyr.
  ```r
  mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::mutate(am, .keep = "none") |> dplyr::collect()
  ```
- Correct the order of columns in results of `group_by(foo) |> transmute()`
  Currently, the results of the following query show that the columns used for grouping have moved to the tail and differ from the behavior of dplyr.
  ```r
  mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::transmute(mpg) |> dplyr::collect()
  ```
  After `transmute`, the group columns should move to the left. (This is a different behavior from `mutate(.keep = "none")`, which keeps the original position.)

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.

* Closes: #35445

Authored-by: SHIMA Tatsuya <ts1s1andn@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>