Public / arrow / 21564cf3981

Commits

Dewey Dunnington authored and GitHub committed 21564cf398126 Oct 2022
ARROW-17187: [R] Improve lazy ALTREP implementation for String (#14271)

This PR is a more permanent fix for ARROW-16578 (#13415), which noted some very bad performance when calling `unique()` in a character vector created by us. In benchmarking a few other things and implementing some simpler altrep classes around the C Data interface (https://github.com/apache/arrow-nanoarrow/pull/50), I found a few other issues...notably, that calling `head()` (or accessing a single string element) would materialize an entire string vector (rather than just materializing the elements required):

``` r
library(arrow, warn.conflicts = FALSE)

big_string <- as.character(runif(1e6))
big_string_altrep <- as.vector(as_arrow_array(big_string))

bench::mark(head(big_string), head(big_string_altrep))[c("expression", "mem_alloc")]
#> # A tibble: 2 × 2
#>   expression              mem_alloc
#>   <bch:expr>              <bch:byt>
#> 1 head(big_string)               0B
#> 2 head(big_string_altrep)    7.63MB
```

For the original STRING_ELT issue, the previous method may have been slow for a few reasons:

- First, it used `BEGIN_CPP11` and `END_CPP11`, which stack-allocates an 8 kb error message array (which is fast but not trivial).
- Second, it stack-allocates an `RStringViewer` for every string access. This is probably not that bad, but because it's not trivially destructible it couldn't be used without `BEGIN_CPP11`/`END_CPP11`, which allocates more than needed. I statically allocate one of these to solve this.
- Third, it was calling `GetOption()` (via creating the string viewer) on every string access. This is a little tricky to work around...I opted to read the option whenever we create an ALTREP vector. This means that technically the value read when an individual value materializes could be out of date; however, in the case that a user does `options(arrow.strip_nuls = TRUE)` it will be true for the rest of the session.
- Fourth, it `unwind_protect`ed for every string access (which is fast but not trivial). I fixed this by statically allocating one `RStringViewer` (so that memory does not leak from a longjmp in the frame) and removing the cpp11 calls from the Elt method (since these would throw exceptions where a longjmp is expected by the calling frame).

This PR also releases the `std::shared_ptr<ChunkedArray>` when we materialize our ALTREP vectors. We use it for length/sum/min/max implementations and for roundtripping but in many cases it's not used and is just taking up memory that could be released and used for something else. I ran ursabot's benchmarks to make sure this doesn't slow things down, and it doesn't (at least for our existing benchmarks).

In changing the above bits, I ran test coverage and noticed that there were parts of altrep.cpp that weren't tested by our test suite. Testing ALTREP is hard because it's almost completely undocumented and where exactly it gets used in the R API changes with each release. I added some helper methods and tests that more accurately reflect the C/C++ usage of ALTREP.

Finally, I noticed that we were doing a "from the beginning" (O(N squred)) search rather than a binary search O(log(n)) in some places to resolve an array/index in the array. This has a non-trivial effect on arrays with many chunks:

Before this PR:

``` r
library(arrow, warn.conflicts = FALSE)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.

chunks <- lapply(1:1000, function(i) runif(1e3))
altrep_one_chunk <- as.vector(ChunkedArray$create(unlist(chunks)))
altrep_many_chunks <- as.vector(ChunkedArray$create(!!!chunks))
random_indices <- sample(seq_len(1e6), 1e6, replace = FALSE)

bench::mark(
  altrep_one_chunk[random_indices],
  altrep_many_chunks[random_indices]
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 × 6
#>   expression                              min   median `itr/sec` mem_a…¹ gc/se…²
#>   <bch:expr>                         <bch:tm> <bch:tm>     <dbl> <bch:b>   <dbl>
#> 1 altrep_one_chunk[random_indices]    68.99ms  72.05ms    12.5    7.63MB  16.1  
#> 2 altrep_many_chunks[random_indices]    4.36s    4.36s     0.229  7.63MB   0.459
#> # … with abbreviated variable names ¹mem_alloc, ²`gc/sec`
```

After this PR:

``` r
library(arrow, warn.conflicts = FALSE)

chunks <- lapply(1:1000, function(i) runif(1e3))
altrep_one_chunk <- as.vector(ChunkedArray$create(unlist(chunks)))
altrep_many_chunks <- as.vector(ChunkedArray$create(!!!chunks))
random_indices <- sample(seq_len(1e6), 1e6, replace = FALSE)

bench::mark(
  altrep_one_chunk[random_indices],
  altrep_many_chunks[random_indices]
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 × 6
#>   expression                              min   median `itr/sec` mem_a…¹ gc/se…²
#>   <bch:expr>                         <bch:tm> <bch:tm>     <dbl> <bch:b>   <dbl>
#> 1 altrep_one_chunk[random_indices]     64.7ms   71.5ms     13.1   7.63MB   16.8 
#> 2 altrep_many_chunks[random_indices]  113.1ms  117.1ms      7.85  7.63MB    9.82
#> # … with abbreviated variable names ¹mem_alloc, ²`gc/sec`
```

I solved this by using Arrow's `ChunkResolver`. This resulted in a small change in the ALTREP implementation: it is now an external pointer around a custom `ArrowAltrepData` class that contains a `shared_ptr<ChunkedArray>` and a `ChunkResolver` (before it was just an external pointer to a `shared_ptr<ChunkedArray>`).

Lead-authored-by: Dewey Dunnington <dewey@fishandwhistle.net>
Co-authored-by: Dewey Dunnington <dewey@voltrondata.com>
Signed-off-by: Dewey Dunnington <dewey@voltrondata.com>