Public / arrow / f4500255220

Commits

Wes McKinney authored and Philipp Moritz committed f450025522031 Aug 2017
ARROW-1381: [Python] Use FixedSizeBufferWriter in SerializedPyObject.to_buffer

With this setup:

```
import numpy as np
import pyarrow as pa

objects = [np.random.randn(500, 500) for i in range(400)]
serialized = pa.serialize(objects)
```

I have before:

```
In [3]: %timeit buf = serialized.to_buffer()
201 ms ± 1.87 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

and after:

```
In [4]: %timeit buf = serialized.to_buffer()
81.1 ms ± 233 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

I added an `nthreads` option but note that when the objects are small, multithreading makes things slower due to the overhead of launching threads. I think the 1MB threshold in `arrow/io/memory.cc` may be too small, we might do some benchmarking to find a better default crossover point for switching between parallel and serial memcpy:

```
In [2]: %timeit buf = serialized.to_buffer(nthreads=4)
134 ms ± 1.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

cc @pcmoritz @robertnishihara

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes #1017 from wesm/ARROW-1381 and squashes the following commits:

fbd0028 [Wes McKinney] Add unit test for SerializedPyObject.to_buffer
ab85230 [Wes McKinney] Add nthreads option for turning on multithreaded memcpy
db12072 [Wes McKinney] Use FixedSizeBufferWriter in SerializedPyObject.to_buffer