Public / arrow / 9347731fe61

Commits

Neal Richardson authored 9347731fe6113 May 2021
ARROW-12731: [R] Use InMemoryDataset for Table/RecordBatch in dplyr code

Discussing with @bkietz on #10166, we realized that we could already evaluate filter/project on Table/RecordBatch by wrapping it in InMemoryDataset and using the Dataset machinery, so I wanted to see how well that worked. Mostly it does, with a couple of caveats:

* You can't dictionary_encode a dataset column. `Error: Invalid: ExecuteScalarExpression cannot Execute non-scalar expression {x=dictionary_encode(x, {NON-REPRESENTABLE OPTIONS})}` (ARROW-12632). I will remove the `as.factor` method and leave a TODO to restore it after that JIRA is resolved.
* with the existing array_expressions, you could supply an additional Array (or R data convertible to an Array) when doing `mutate()`; this is not implemented for Datasets and that's ok. For Tables/RecordBatches, the behavior in this PR is to pull the data into R, which is fine.

There are a lot of changes here, which means the diff is big, but I've tried to group into distinct commits the main action. Highlights:

* https://github.com/apache/arrow/pull/10191/commits/5b501c508e8da7313dce0e361369dc62aa645a8f is the main switch to use InMemoryDataset
* https://github.com/apache/arrow/pull/10191/commits/b31fb5e594bc49628f7a4459109784caafe99cb4 deletes `array_expression`
* https://github.com/apache/arrow/pull/10191/commits/0d3193863fc578d93d9319ea2184e46e9f2f36e1 simplifies the interface for adding functions to the dplyr data_mask; definitely check this one out and see what you think of the new way--I hope it's much simpler to add new functions
* https://github.com/apache/arrow/pull/10191/commits/2e6374f94cbcc236becc3e41797a26127cf06ab0 improves the print method for queries by showing both the expression and the expected type of the output column, per suggestion from @bkietz
* https://github.com/apache/arrow/pull/10191/commits/d12f584e67531e251a1c72a5b67e14361d31f503 just splits up dplyr.R into many files; https://github.com/apache/arrow/pull/10191/commits/34dc1e6589ca622c8b1baeba7ce03c1d2b0b4c28 deletes tests that are duplicated between test-dplyr*.R and test-dataset.R (since they're now going through a common C++ interface).
* https://github.com/apache/arrow/pull/10191/commits/a0914f67319e659348396f106024d69064ea3943 + https://github.com/apache/arrow/pull/10191/commits/eee491a4e9e6735a0f304d1d71306bfd091f702b contain ARROW-12696

Closes #10191 from nealrichardson/dplyr-in-memory

Authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>