Public / arrow / 208e79812b5

Commits

Wes McKinney authored 208e79812b508 Oct 2017
ARROW-1594: [Python] Multithreaded conversions to Arrow in from_pandas

This results in nice speedups when column conversions do not require GIL to be held:

```python
In [5]: import numpy as np

In [6]: import pandas as pd

In [7]: import pyarrow as pa

In [8]: NROWS = 1000000

In [9]: NCOLS = 50

In [10]: arr = np.random.randn(NCOLS, NROWS).T

In [11]: arr[::5] = np.nan

In [12]: df = pd.DataFrame(arr)

In [13]: %timeit rb = pa.RecordBatch.from_pandas(df, nthreads=1)
10 loops, best of 3: 179 ms per loop

In [14]: %timeit rb = pa.RecordBatch.from_pandas(df, nthreads=4)
10 loops, best of 3: 59.7 ms per loop
```

This introduces a dependency on the `futures` Python 2.7 backport of concurrent.futures (PSF license)

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes #1186 from wesm/multithreaded-from-pandas and squashes the following commits:

a3072f0e [Wes McKinney] Only install futures on py2
c30e4735 [Wes McKinney] Add heuristic to use threadpool conversion only if nrows > ncols * 100
5a692085 [Wes McKinney] Only install concurrent.futures backport on py2, test serialize_pandas with nthreads
0afab342 [Wes McKinney] Add nthreads argument to serialize_pandas, make default for serialize/deserialize consistent
15841d13 [Wes McKinney] Default to cpu_count() for nthreads in from_pandas to conform with to_pandas default
6a58c038 [Wes McKinney] Add nthreads argument to RecordBatch/Table.from_pandas. Use concurrent.futures for parallel processing