Public / arrow / a75299d8cb2

Commits

Dewey Dunnington authored and GitHub committed a75299d8cb229 Jun 2023
GH-35649: [R] Always call `RecordBatchReader::ReadNext()` from DuckDB from the main R thread (#36307)

### Rationale for this change

When passing a DuckDB result to Arrow via `to_arrow()` whose input was an Arrow dataset, calls to R code from other threads can occur in some DuckDB operations. This caused a crash or hang on Linux when attempting to combine `pivot_longer()` and `write_dataset()`.

### What changes are included in this PR?

- Added a wrapper class around the `RecordBatchReader` that routes calls to `ReadNext()` through `SafeCallIntoR()`.

### Are these changes tested?

I can't find a new case that isn't covered by our existing tests, although I did remove a skip that was causing a similar problem at one point (#33033). Because it's difficult to predict/test where duckdb evaluates R code, it's hard to know exactly what to test here (I would have expected R code to be evaluated/a crash to occur with many of our existing tests, but even the `pivot_longer()` example does not crash on MacOS and Windows 🤷 ).

I did verify on Ubuntu 22.04 that the reprex kindly provided by @ PMassicotte errors before this PR and does not error after this PR:

```r
library(tidyverse)
library(arrow)

one_level_tree <- tempfile()

mtcars |>
  to_duckdb() |>
  pivot_longer(everything()) |>
  to_arrow() |>
  # collect() |> # collecting make it work, otherwise, it hangs on write_dataset()
  write_dataset(one_level_tree, partitioning = "name")

list.files(one_level_tree, recursive = TRUE)
```

### Are there any user-facing changes?

There are no user facing changes.
* Closes: #35649

Authored-by: Dewey Dunnington <dewey@voltrondata.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>