Commits

Wes McKinney authored b19e183c152
ARROW-1783: [Python] Provide a "component" dict representation of a serialized Python object with minimal allocation For systems (like Dask) that prefer to handle their own framed buffer transport, this provides a list of memoryview-compatible objects with minimal copying / allocation from the input data structure, which can similarly be zero-copy reconstructed to the original object. To motivate the use case, consider a dict of ndarrays: ``` data = {i: np.random.randn(1000, 1000) for i in range(50)} ``` Here, we have: ``` >>> %timeit serialized = pa.serialize(data) 52.7 µs ± 1.01 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) ``` This is about 400MB of data. Some systems may not want to double memory by assembling this into a single large buffer, like with the `to_buffer` method: ``` >>> written = serialized.to_buffer() >>> written.size 400015456 ``` We provide a `to_components` method which contains a dict with a `'data'` field containing a list of `pyarrow.Buffer` objects. This can be converted back to the original Python object using `pyarrow.deserialize_components`: ``` >>> %timeit components = serialized.to_components() 73.8 µs ± 812 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) >>> list(components.keys()) ['num_buffers', 'data', 'num_tensors'] >>> len(components['data']) 101 >>> type(components['data'][0]) pyarrow.lib.Buffer ``` and ``` >>> %timeit recons = pa.deserialize_components(components) 93.6 µs ± 260 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) ``` The reason there are 101 data components (1 + 2 * 50) is that: * 1 buffer for the serialized Union stream representing the object * 2 buffers for each of the tensors: 1 for the metadata and 1 for the tensor body. The body is separate so that this is zero-copy from the input Next step after this is ARROW-1784 which is to transport a pandas.DataFrame using this mechanism cc @pitrou @jcrist @mrocklin Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1362 from wesm/ARROW-1783 and squashes the following commits: 4ec5a89f [Wes McKinney] Add missing decref on error e8c76d42 [Wes McKinney] Acquire GIL in GetSerializedFromComponents 1d2e0e27 [Wes McKinney] Fix function documentation fffc7bb6 [Wes McKinney] Typos, add deserialize_components to API 50d2fee5 [Wes McKinney] Finish componentwise serialization roundtrip 58174dde [Wes McKinney] More progress, stubs for reconstruction b1e31a34 [Wes McKinney] Draft GetTensorMessage 337e1d29 [Wes McKinney] Draft SerializedPyObject::GetComponents 598ef335 [Wes McKinney] Tweak