Commits


Hideaki Hayashi authored and GitHub committed 9ad22551f73
ARROW-16578: [R] unique() and is.na() on a column of a tibble is much slower after writing to and reading from a parquet file (#13415) Fixes ARROW-16578 "[R] unique() and is.na() on a column of a tibble is much slower after writing to and reading from a parquet file". Here I'm materializing the AltrepVectorString at the first call to Elt. My thought is that it would make sense since it is likely that there will be another call from R if there is one call (e.g. unique()), and also because getting a string from Array seems to be much more costly than from data2. Something like 3-strike rule may make sense too, but here in this PR, I'm taking this simple approach. ARROW-16578 reprex with the fix: ``` > df1 <- tibble::tibble(x=as.character(floor(runif(1000000) * 20))) > write_parquet(df1,"/tmp/test.parquet") > df2 <- read_parquet("/tmp/test.parquet") > system.time(unique(df2$x)) user system elapsed 0.074 0.002 0.082 > system.time(unique(df1$x)) user system elapsed 0.022 0.001 0.025 > system.time(is.na(df2$x)) user system elapsed 0.006 0.001 0.006 > system.time(is.na(df1$x)) user system elapsed 0.003 0.000 0.004 ``` devtools::test() result: ``` [ FAIL 0 | WARN 0 | SKIP 30 | PASS 7271 ] ``` Authored-by: Hideaki Hayashi <hihayash@gmail.com> Signed-off-by: Dewey Dunnington <dewey@fishandwhistle.net>