Public / arrow / 9b03947c436

Commits

Wes McKinney authored 9b03947c43628 Dec 2018
ARROW-3928: [Python] Deduplicate Python objects when converting binary, string, date, time types to object arrays

This adds a `deduplicate_objects` option to all of the `to_pandas` methods. It works with string types, date types (when `date_as_object=True`), and time types.

I also made it so that `ScalarMemoTable` can be used with `string_view`, for more efficient memoization in this case.

I made the default for `deduplicate_objects` is True. When the ratio of unique strings to the length of the array is low, not only does this use drastically less memory, it is also faster. I will write some benchmarks to show where the "crossover point" is when the overhead of hashing makes things slower.

Let's consider a simple case where we have 10,000,000 strings of length 10, but only 1000 unique values:

```
In [50]: import pandas.util.testing as tm

In [51]: unique_values = [tm.rands(10) for i in range(1000)]

In [52]: values = unique_values * 10000

In [53]: arr = pa.array(values)

In [54]: timeit arr.to_pandas()
236 ms ± 1.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [55]: timeit arr.to_pandas(deduplicate_objects=False)
730 ms ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

Almost 3 times faster in this case. The different in memory use is even more drastic

```
In [44]: unique_values = [tm.rands(10) for i in range(1000)]

In [45]: values = unique_values * 10000

In [46]: arr = pa.array(values)

In [49]: %memit result11 = arr.to_pandas()
peak memory: 1505.89 MiB, increment: 76.27 MiB

In [50]: %memit result12 = arr.to_pandas(deduplicate_objects=False)
peak memory: 2202.29 MiB, increment: 696.11 MiB
```

As you can see, this is a huge problem. If our bug reports about Parquet memory use problems are any indication, users have been suffering from this issue for a long time.

When the strings are mostly unique, then things are slower as expected, the peak memory use is higher because of the hash table

```
In [17]: unique_values = [tm.rands(10) for i in range(500000)]

In [18]: values = unique_values * 2

In [19]: arr = pa.array(values)

In [20]: timeit result = arr.to_pandas()
177 ms ± 574 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [21]: timeit result = arr.to_pandas(deduplicate_objects=False)
70.1 ms ± 783 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [42]: %memit result8 = arr.to_pandas()
peak memory: 644.39 MiB, increment: 92.23 MiB

In [43]: %memit result9 = arr.to_pandas(deduplicate_objects=False)
peak memory: 610.85 MiB, increment: 58.41 MiB
```

In real world work, many duplicated strings is the most common use case. Given the massive memory use and moderate performance improvements, it makes sense to have this enabled by default.

Author: Wes McKinney <wesm+git@apache.org>

Closes #3257 from wesm/ARROW-3928 and squashes the following commits:

d9a88700 <Wes McKinney> Prettier output
a00b51c7 <Wes McKinney> Add benchmarks for object deduplication
ca88b963 <Wes McKinney> Add Python unit tests, deduplicate for date and time types also when converting to Python objects
7a7873b8 <Wes McKinney> First working iteration of string deduplication when calling to_pandas