Public / arrow / f6bfa7b2920

Commits

Jonas Dedden authored and GitHub committed f6bfa7b292020 Feb 2025
GH-39010: [Python] Introduce `maps_as_pydicts` parameter for `to_pylist`, `to_pydict`, `as_py` (#45471)

### Rationale for this change

Currently, unfortunately `MapScalar`/`Array` types are not deserialized into proper Python `dict`s, which is unfortunate since this breaks "roundtrips" from Python -> Arrow -> Python:

```
import pyarrow as pa

schema = pa.schema([pa.field('x', pa.map_(pa.string(), pa.int64()))])
data = [{'x': {'a': 1}}]
pa.RecordBatch.from_pylist(data, schema=schema).to_pylist()
# [{'x': [('a', 1)]}]
```

This is especially bad when storing TiBs of deeply nested data (think of lists in structs in maps...) that were created from Python and serialized into Arrow/Parquet, since they can't be read in again with native `pyarrow` methods without doing extremely ugly and computationally costly workarounds.

### What changes are included in this PR?

A new parameter `maps_as_pydicts` is introduced to `to_pylist`, `to_pydict`, `as_py` which will allow proper roundtrips:

```
import pyarrow as pa

schema = pa.schema([pa.field('x', pa.map_(pa.string(), pa.int64()))])
data = [{'x': {'a': 1}}]
pa.RecordBatch.from_pylist(data, schema=schema).to_pylist(maps_as_pydicts="strict")
# [{'x': {'a': 1}}]
```

### Are these changes tested?

Yes. There are tests for `to_pylist` and `to_pydict` included for `pyarrow.Table`, whilst low-level `MapScalar` and especially a nesting with `ListScalar` and `StructScalar` is tested.

Also, duplicate keys now should throw an error, which is also tested for.

### Are there any user-facing changes?

No callsites should be broken, simply a new keyword-only optional parameter is added.
* GitHub Issue: #39010

Authored-by: Jonas Dedden <university@jonas-dedden.de>
Signed-off-by: Antoine Pitrou <antoine@python.org>