Commits


Romain Francois authored and Wes McKinney committed 3b61349b3c1
ARROW-2968: [R] Multi-threaded conversion from Arrow table to R data.frame The `as_tibble()` methods for `arrow::RecordBatch` and `arrow::Table` gained a `use_threads` argument. When set to `TRUE` columns of a record batch or table are converted to R vectors in parallel. We cannot allocate R data structures in parallel (including scalar strings), so it goes like this: ``` for each column: - allocate the R vector host for the array - if that can be done in parallel, fill the R vector with data from the array fill serially all columns that could not be filled in parallel wait for all columns to be full ``` This is I believe better (although perhaps harder to explain) than - allocate all the vectors - fill them in parallel Because we don't have to wait for all the vectors to be allocated to start filling them. I believe the python does that, in `DataFrameBlockCreator::Convert` ``` RETURN_NOT_OK(CreateBlocks()); RETURN_NOT_OK(WriteTableToBlocks()); ``` I've had to split the implementation of `Array__as_vector` into two steps: - Allocate: this must happen on the main thread, or alternatively would need to mutex R - Ingest: For most array types, this can be done in parallel Author: Romain Francois <romain@purrple.cat> Closes #3332 from romainfrancois/2968/threads and squashes the following commits: 8261f2907 <Romain Francois> sprinkle use_threads in functions that call as_tibble() 3205de2d8 <Romain Francois> lint 590baf5a6 <Romain Francois> using string_view cd0dd343e <Romain Francois> no need for checkBuffers 29546cd5d <Romain Francois> Some more refactoring of the Converters 5557b7974 <Romain Francois> refactor the Converter api, so that all Converters are implementations of the base class Converter. e2ed26b78 <Romain Francois> lint 2a5815e03 <Romain Francois> moving parallel_ingest() to a static method of the Converter classes 2613d4ec4 <Romain Francois> null_count already local variable 62a842054 <Romain Francois> + to_r_index lambda, with comment about why +1 52c725fc8 <Romain Francois> default_value() marked constexpr 11e82e769 <Romain Francois> lint d22b9c551 <Romain Francois> parallel version of Table__to_dataframe 2455bd057 <Romain Francois> parallel version of RecordBatch__to_dataframe 380d3a5bc <Romain Francois> simplify ArrayVector__as_vector. 85881a3e2 <Romain Francois> simplify ArrayVector_To_Vector 7074b36e9 <Romain Francois> reinstate Converter_Timestamp so that ArrayVector__as_vector can be simplified cf7e76bae <Romain Francois> + parallel_ingest<Converter>() to indicate if ingest for a givne converter can be doine in parallel baaaefe1b <Romain Francois> Re"work Converter api e650b7934 <Romain Francois> + arrow::r::inspect(SEXP) for debugging a335dfdfc <Romain Francois> Factor out Array -> R vector code in separate file 1212e28a9 <Romain Francois> <Converter>.Ingest() return an Invalid status instead of throwing an exception 39bf76403 <Romain Francois> <Converter>.Ingest() return a Status instead of void f68b79376 <Romain Francois> replaced DictionaryArrays_to_Vector and Converter_Dictionary_Int32Indices by Converter_Dictionary d25a0e6b5 <Romain Francois> replace Date32ArrayVector_to_Vector by Converter_Date32 85e48c0c7 <Romain Francois> lint 18b921e6f <Romain Francois> + Get/Set ThreadPoolCapacity