Public / arrow / f9bde6d68b6

Commits

Neal Richardson authored f9bde6d68b629 Oct 2019
ARROW-6980: [R] dplyr backend for RecordBatch/Table

The [basic single-table verbs](https://dplyr.tidyverse.org/reference/index.html#section-basic-single-table-verbs) are all implemented:

* `select`/`rename` and `filter` mostly do their work in Arrow, deferred until when `collect` is called (usually). Column renaming tracks the changes to column names and applies them to the resulting `data.frame`.
* `collect` returns a `data.frame`
* `summarize` figures out what columns it needs, then calls `select` again with that, then `collect`, and finally `summarize` on the data.frame in R.
* `group_by` records annotations and passes them along to the `data.frame` when it is produced by `collect`.
* `mutate` calls `collect` then `mutate` on the data.frame. `transmute` just works because the default method calls `mutate`, though we could optimize it and pull less data out of Arrow.
* `arrange` calls `collect` first.
* `pull` calls `collect` to retrieve the indicated column

`dplyr` is added to Suggests and the S3 methods are registered .onLoad using a function from `vctrs` (moved from Suggests to Imports), so it is not a hard dependency.

There's a test helper there that asserts that both RecordBatches and Tables yield the same results as a tibble, given a sequence of pipes.

Obviously there's lots more we can do, but this proves the concepts and should be a useful starting point for the Dataset interface. And we can still push more work down into Arrow once we have bindings for more compute kernels.

Pending feedback of course, my outstanding ideas for this PR are:

- [x] Consider creating a separate class/object for an arrow-dplyr query, rather than appending attributes to the RecordBatch and Table classes and have to worry about their state being modified. This is the model we'll have anyway for Datasets (where you create a Scan or something and that's where you do your query assembly).

Closes #5661 from nealrichardson/dplyr-verbs and squashes the following commits:

3be00107a <Neal Richardson> Minor cleanups
ed515a762 <Neal Richardson> Create arrow_dplyr_query class and move logic into it
afca7bd21 <Neal Richardson> Use Expressions in filter method and handle renaming
ffb91d93a <Neal Richardson> Rebase to get Expressions back
3dcc4d58b <Neal Richardson> Some tricks to get renaming to work better
485567397 <Neal Richardson> Support pull and basic renaming
13fdac306 <Neal Richardson> Add arrange, test for transmute (just works), other tests as in dtplyr. Also add test helper to more easily assert parity
9f2504af2 <Neal Richardson> Undo some test finessing because the expect_equivalent override covers them
9f9ffec2b <Neal Richardson> Prune the Expression class (moved to a separate branch)
6b8a0ac85 <Neal Richardson> Add dplyr methods for Table
c9c67299f <Neal Richardson> Incorporate feedback from @hadley
dc1407d49 <Neal Richardson> Add mutate method(); make summarize() only collect what it needs; reorg csv tests to debug an unrelated issue
7f4066c0a <Neal Richardson> Fix some check issues
8d1e9dd8c <Neal Richardson> Lint
e5ad2cf81 <Neal Richardson> Make tests pass again; move dplyr to Suggests
3cdecd094 <Neal Richardson> Record group_by and pass that along through collect()
1b53f5b4b <Neal Richardson> Add Expression (R only) class and evaluate filters more efficiently in collect()
45690f686 <Neal Richardson> Basic select/filter/collect/summarize for RecordBatch

Authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>