Commits


Dewey Dunnington authored and Neal Richardson committed 6cf344b61e6
ARROW-9235: [R] Support for `connection` class when reading and writing files This is a PR to support arbitrary R "connection" objects as Input and Output streams. In particular, this adds support for sockets (ARROW-4512), URLs, and some other IO operations that are implemented as R connections (e.g., in the [archive](https://github.com/r-lib/archive#archive) package). The gist of it is that you should be able to do this: ``` r # remotes::install_github("paleolimbot/arrow/r@r-connections") library(arrow, warn.conflicts = FALSE) addr <- "https://github.com/apache/arrow/raw/master/r/inst/v0.7.1.parquet" stream <- arrow:::make_readable_file(addr) rawToChar(as.raw(stream$Read(4))) #> [1] "PAR1" stream$close() stream <- arrow:::make_readable_file(url(addr, open = "rb")) rawToChar(as.raw(stream$Read(4))) #> [1] "PAR1" stream$close() ``` There are two serious issues that prevent this PR from being useful yet. First, it uses functions that R considers "non-API" functions from the C API. > checking compiled code ... NOTE File ‘arrow/libs/arrow.so’: Found non-API calls to R: ‘R_GetConnection’, ‘R_ReadConnection’, ‘R_WriteConnection’ Compiled code should not call non-API entry points in R. We can get around this by calling back into R (in the same way this PR implements `Tell()` and `Close()`). We could also go all out and implement the other half (exposing `InputStream`/`OutputStream`s as R connections) and ask for an exemption (at least one R package, curl, does this). The archive package seems to expose connections without a NOTE on the CRAN check page, so perhaps there is also a workaround. Second, we get a crash when passing the input stream to most functions. I think this is because the `Read()` method is getting called from another thread but it also could be an error in my implementation. If the issue is threading, we would have to arrange a way to queue jobs for the R main thread (e.g., how the [later](https://github.com/r-lib/later#background-tasks) package does it) and a way to ping it occasionally to fetch the results. This is complicated but might be useful for other reasons (supporting evaluation of R functions in more places). It also might be more work than it's worth. ``` r # remotes::install_github("paleolimbot/arrow/r@r-connections") library(arrow, warn.conflicts = FALSE) addr <- "https://github.com/apache/arrow/raw/master/r/inst/v0.7.1.parquet" read_parquet(addr) ``` ``` *** caught segfault *** address 0x28, cause 'invalid permissions' Traceback: 1: parquet___arrow___FileReader__OpenFile(file, props) ``` Closes #12323 from paleolimbot/r-connections Lead-authored-by: Dewey Dunnington <dewey@fishandwhistle.net> Co-authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>