Commits


Maarten A. Breddels authored and Antoine Pitrou committed 8773b9d45d2
ARROW-10557: [C++] Add scalar string slicing/substring extract kernel Needs a rebase after https://github.com/apache/arrow/pull/8621 is merged I totally agree with https://github.com/python/cpython/blob/c9bc290dd6e3994a4ead2a224178bcba86f0c0e4/Objects/sliceobject.c#L252 This was tricky to get right, the main difficulty is in manually dealing with reverse iterators. Therefore I put on extra guardrails by having the Python unittests cover a lot of cases. All edge cases detected by this are translated to the C++ unittest suite, so we could reduce them to reduce pytest execution cost (I added 1 second). Slicing is based on Python, `[start, stop)` inclusive/exclusive semantics, where an index refers to a codeunit (like Python apparently, badly documented), and negative indices start counting from the right. `step != 0` is supported, like Python. The only thing we cannot support easily, are things like reversing a string, since in Python one can do `s[::-1]` or `s[-1::-1]`, but we don't support empty values with the Option machinery (we model this as an c-`int64`). To mimic this, we can do `pc.utf8_slice_codeunits(ar, start=-1, end=-sys.maxsize, step=-1)` (i.e. a very large negative value). For instance, libraries such as Pandas and Vaex can do sth like that, confirmed to be working by modifying the unittest like this: ```python import sys @pytest.mark.parametrize('start', list(range(-6, 6)) + [None]) @pytest.mark.parametrize('stop', list(range(-6, 6)) + [None]) @pytest.mark.parametrize('step', [-3, -2, -1, 1, 2, 3]) def test_slice_compatibility(start,stop, step): input = pa.array(["", "𝑓", "𝑓ö", "𝑓öõ", "𝑓öõḍ", "𝑓öõḍš"]) expected = pa.array([k.as_py()[start:stop:step] for k in input]) if start is None: start = -sys.maxsize if step > 0 else sys.maxsize if stop is None: stop = sys.maxsize if step > 0 else -sys.maxsize result = pc.utf8_slice_codeunits(input, start=start, stop=stop, step=step) assert expected.equals(result) ``` So libraries using this can implement the full Python behavior with this workaround. Closes #9000 from maartenbreddels/ARROW-10557 Lead-authored-by: Maarten A. Breddels <maartenbreddels@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>