Commits

Wes McKinney authored 3b000b7062b
ARROW-3789: [Python] Use common conversion path for Arrow to pandas.Series/DataFrame. Zero copy optimizations for DataFrame, add split_blocks and self_destruct options The primary goal of this patch is to provide a way for some users to avoid memory doubling with converting from Arrow to pandas. This took me entirely too much time to get right, but partly I was attempting to disentangle some of the technical debt and overdue refactoring in arrow_to_pandas.cc. Summary of what's here: - Refactor ChunkedArray->Series and Table->DataFrame conversion paths to use the exact same code rather than two implementations of the same thing with slightly different behavior. The `ArrowDeserializer` helper class is now gone - Do zero-copy construction of internal DataFrame blocks for the case of a contiguous non-nullable array and a block with only 1 column represented - Add `split_blocks` option to `to_pandas` which constructs one block per DataFrame column, resulting in more zero-copy opportunities. Note that pandas's internal "consolidation" can still cause memory doubling (see discussion about this in https://github.com/pandas-dev/pandas/issues/10556) - Add `self_destruct` option to `to_pandas` which releases the Table's internal buffers as soon as they are converted to the required pandas structure. This allows memory to be reclaimed by the OS as conversion is taking place rather than having a forced memory-doubling and then post-facto reclamation (which has been causing OOM for some users) The most conservative invocation of `to_pandas` now would be `table.to_pandas(use_threads=False, split_blocks=True, self_destruct=True)` Note that the self-destruct option makes the `Table` object unsafe for further use. This is a bit dissatisfying but I wasn't sure how else to provide this capability. Closes #6067 from wesm/ARROW-3789 and squashes the following commits: 3b4260283 <Wes McKinney> Code review comments 8f39cce05 <Wes McKinney> Add some documentation. Try fixing MSVC warnings c22d280dc <Wes McKinney> Fix one MSVC cast warning 43068032c <Wes McKinney> Add "split blocks" and "self destruct" options to Table.to_pandas, with zero-copy operations for improved memory use when converting from Arrow to pandas Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org>