Commits


Romain Francois authored and Neal Richardson committed 5e150bb7a68
ARROW-9557: [R] Iterating over parquet columns is slow in R I don't think this is about `shared_ptr_is_null()` as indicated in the jira issue: https://issues.apache.org/jira/browse/ARROW-9557 I guess profvis (or probably the underlying profiler) struggles with that case. What happens though is that `$ReadTable()` first calls `$GetSchema()`: ```r ReadTable = function(col_select = NULL) { col_select <- enquo(col_select) if (quo_is_null(col_select)) { shared_ptr(Table, parquet___arrow___FileReader__ReadTable1(self)) } else { all_vars <- shared_ptr(Schema, parquet___arrow___FileReader__GetSchema(self))$names indices <- match(vars_select(all_vars, !!col_select), all_vars) - 1L shared_ptr(Table, parquet___arrow___FileReader__ReadTable2(self, indices)) } } ``` and that's expensive for some reason: ``` r library(arrow) #> #> Attaching package: 'arrow' #> The following object is masked from 'package:utils': #> #> timestamp library(vctrs) #> #> Attaching package: 'vctrs' #> The following objects are masked from 'package:arrow': #> #> field, list_of library(purrr) df <- new_data_frame( map(set_names(1:4000), ~rnorm(50000)) ) tf <- tempfile() write_parquet(df, tf) reader <- ParquetFileReader$create(tf) parquet___arrow___FileReader__GetSchema <- arrow:::parquet___arrow___FileReader__GetSchema parquet___arrow___FileReader__ReadColumn <- arrow:::parquet___arrow___FileReader__ReadColumn system.time({ for (i in 1:4000) { parquet___arrow___FileReader__GetSchema(reader) } }) #> user system elapsed #> 43.809 1.744 47.962 system.time({ for (i in 1:4000) { parquet___arrow___FileReader__ReadColumn(reader, i) } }) #> user system elapsed #> 3.035 2.448 10.606 ``` <sup>Created on 2020-09-07 by the [reprex package](https://reprex.tidyverse.org) (v0.3.0.9001)</sup> So we probably need a more complete R6 wrapper around `parquet::arrow::FileReader`. https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/reader.h#L107 As a start, here is `$GetColumn()` Closes #8122 from romainfrancois/ARROW-9557/ParquetFileReader Lead-authored-by: Romain Francois <romain@rstudio.com> Co-authored-by: Romain François <romain@rstudio.com> Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>