Commits


Krisztián Szűcs authored and Benjamin Kietzman committed 87dd7e9894b
ARROW-9992: [C++][Python] Refactor python to arrow conversions based on a reusable conversion API ### Targets of the refactoring: - PythonToArrow converters based on a common API - PyBytesView to use `Result` return values and contain `is_utf8` flag - PyConversionOptions is now available from all converters so we can honor its flags ### Fixes - ARROW-9993 [Python] Tzinfo - string roundtrip fails on pytz.StaticTzInfo objects - ARROW-9994 [C++][Python] Auto chunking nested array containing binary-like fields result malformed output - ARROW-9996 [C++] Dictionary is unset when calling DictionaryArray.GetScalar for null values - ~ARROW-9997 [Python] StructScalar.as_py() fails if the type has duplicate field names~ - ARROW-9999 [Python] Support constructing dictionary array directly through pa.array() - ARROW-10000 [C++][Python] Support constructing StructArray from list of key-value pairs - ARROW-9593 [Python] Add custom pickle reducers for DictionaryScalar - ARROW-6281 [Python] Produce chunked arrays for nested types in pyarrow.array - ARROW-2367 [Python] ListArray has trouble with sizes greater than kMaximumCapacity - ARROW-9976: [Python] ArrowCapacityError when doing Table.from_pandas with large dataframe ### Backward incompatibility ~~Since a struct type can contain duplicated field names we cannot return a struct scalar as a mapping, so I had to change the `.as_py()` representation to return with a list of key-value pairs.~~ ### TODOs: - [x] ensure that the large memory tests are passing - [x] benchmark and check binary size again ### Library size Before: ``` 12M Sep 25 15:05 libarrow.200.0.0.dylib 2.7M Sep 25 15:07 libarrow_python.200.0.0.dylib ``` After: ``` 12M Sep 25 15:46 libarrow.200.0.0.dylib 2.1M Sep 25 15:50 libarrow_python.200.0.0.dylib ``` ### Benchmarks Executed the following ASV benchmark: ```bash asv continuous --bench convert_builtins master py2ar --no-only-changed --split ``` After some optimization: ``` Benchmarks that have improved: before after ratio [f358a29b] [18d1c052] <master> <py2ar> - 2.78±0.03ms 2.45±0.03ms 0.88 convert_builtins.ConvertPyListToArray.time_convert('bool') - 3.59±0.01ms 3.12±0.02ms 0.87 convert_builtins.ConvertPyListToArray.time_convert('int32') - 3.37±0.01ms 2.73±0.01ms 0.81 convert_builtins.ConvertPyListToArray.time_convert('uint32') - 3.74±0.02ms 3.03±0.01ms 0.81 convert_builtins.ConvertPyListToArray.time_convert('int64') - 3.38±0.01ms 2.69±0.01ms 0.80 convert_builtins.ConvertPyListToArray.time_convert('uint64') - 2.83±0.01ms 2.24±0.01ms 0.79 convert_builtins.ConvertPyListToArray.time_convert('float32') - 3.92±0.02ms 2.99±0.02ms 0.76 convert_builtins.ConvertPyListToArray.time_convert('binary10') - 14.1±0.04ms 8.89±0.05ms 0.63 convert_builtins.ConvertPyListToArray.time_convert('unicode') - 5.60±0.01ms 3.24±0.03ms 0.58 convert_builtins.ConvertPyListToArray.time_convert('ascii') - 5.37±0.02ms 2.91±0.04ms 0.54 convert_builtins.ConvertPyListToArray.time_convert('binary') Benchmarks that have stayed the same: before after ratio [f358a29b] [18d1c052] <master> <py2ar> 14.8±0.02ms 15.5±0.1ms 1.05 convert_builtins.ConvertPyListToArray.time_convert('decimal') 16.4±0.7ms 15.1±0.6ms 0.92 convert_builtins.ConvertPyListToArray.time_convert('struct from tuples') 34.4±0.3ms 31.5±0.4ms 0.92 convert_builtins.ConvertPyListToArray.time_convert('int64 list') 16.7±0.7ms 15.1±0.6ms ~0.91 convert_builtins.ConvertPyListToArray.time_convert('struct') 2.42±0.02ms 2.05±0.03ms ~0.85 convert_builtins.ConvertPyListToArray.time_convert('float64') ``` Closes #8088 from kszucs/py2ar Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>