Commits

Jonathan Keane authored 858470d928e
ARROW-14745: [R] Enable true duckdb streaming This enables the ability to stream data back from Arrow via a `RecordBatchReader` instead of always materializing the full table (though that is also possible with an argument). This unlocks the ability to do the following (silly from an analysis standpoint) query, where the `to_arrow()` step uses a `RecordBatchReader` as a source rather than pulling the full table into memory at once. ``` library(arrow, warn.conflicts = FALSE) library(dplyr, warn.conflicts = FALSE) nyc_ds <- open_dataset("path/to/nyc-taxi/", partitioning = c("year", "month")) nyc_ds %>% select(-rate_code_id) %>% mutate(day = wday(dropoff_at), hour = hour(dropoff_at)) %>% to_duckdb() %>% # arrow doesn't (yet, at least in dplyr) support slicing like this, but this could be anything that one wants to do in duckdb group_by(year, month, day, hour) %>% slice_max(tip_amount) %>% to_arrow() %>% # but we can group_by %>% summarise() group_by(day, hour) %>% summarise(mean_tip = mean(tip_amount)) ``` A few notes: * This should only be merged after https://github.com/duckdb/duckdb/pull/2957 is merged. We get mixed up data when pulling `to_arrow()` on datasets without that PR. The tests are gated to only run after the next release of DuckDB (0.3.2). The failure on rhub/debian-gcc-devel:latest is because that run actually installs DuckDB from github, which has that version number but not yet the patch on the PR * This also slightly changes the return value of `to_arrow()` instead of returning `arrow_dplyr_query(Table)` or `arrow_dplyr_query(RecordBatchReader)`, we now simply return either the `Table` or `RecordBatchReader` and we now have dplyr methods for `filte`/`mutate`/etc. for RecordBatchReaders now. I can undo that change if we want to keep the wrapping, but in my experience with messing with this / trying to find the source of the data corruption bug, having the RecordBatchReader was helpful. ## Testing locally If one wants to test this locally, the easiest way is to install duckdb from pedro's branch with: ``` remotes::install_github("pdet/duckdb/tools/rpkg@rarrowstream", build = FALSE) ``` And then arrow from this branch. Closes #11730 from jonkeane/ARROW-14745 Authored-by: Jonathan Keane <jkeane@gmail.com> Signed-off-by: Jonathan Keane <jkeane@gmail.com>