Commits


Kevin Gurney authored and GitHub committed 80b76584729
GH-36250: [MATLAB] Add `arrow.array.StringArray` class (#36366) ### Rationale for this change Thanks to @ sgilmore10's [recent changes to enable UTF-8 <-> UTF-16 string conversions](#36167), we can now add support for creating Arrow `String` arrays (UTF-8 encoded) from MATLAB `string` arrays (UTF-16 encoded). ### What changes are included in this PR? 1. Added new `arrow.array.StringArray` class that can be constructed from MATLAB [`string`](https://www.mathworks.com/help/matlab/ref/string.html?s_tid=doc_ta) and [`cellstr`](https://www.mathworks.com/help/matlab/ref/cellstr.html) types. **Note**: We explicitly decided to *not* support [`char`](https://www.mathworks.com/help/matlab/ref/char.html?s_tid=doc_ta) arrays for the time being. 2. Factored out code for extracting "raw" `const uint8_t*` from a MATLAB `logical` Data Array into a new function `bit::unpacked_as_ptr` so that it can be reused across multiple Array `Proxy` classes. See https://github.com/apache/arrow/issues/36335. 3. Added new `arrow.type.StringType` type class and associated `arrow.type.ID.String` enum value. 4. Enabled support for creating `RecordBatch` objects from MATLAB `table`s containing `string` data. 5. Updated `arrow::matlab::array::proxy::Array::toString` code to convert from UTF-8 to UTF-16 for display in MATLAB. **Examples** *Most MATLAB `string` arrays round-trip* ```matlab >> matlabArray = ["A"; "B"; "C"] matlabArray = 3x1 string array "A" "B" "C" >> arrowArray = arrow.array.StringArray(matlabArray) arrowArray = [ "A", "B", "C" ] >> matlabArrayRoundTrip = toMATLAB(arrowArray) matlabArrayRoundTrip = 3x1 string array "A" "B" "C" >> isequal(matlabArray, matlabArrayRoundTrip) ans = logical 1 ``` *MATLAB `string(missing)` Values get mapped to `null` by default* ```matlab >> matlabArray = ["A"; string(missing); "C"] matlabArray = 3x1 string array "A" <missing> "C" >> arrowArray = arrow.array.StringArray(matlabArray) arrowArray = [ "A", null, "C" ] >> matlabArrayRoundTrip = toMATLAB(arrowArray) matlabArrayRoundTrip = 3x1 string array "A" <missing> "C" >> isequaln(matlabArray, matlabArrayRoundTrip) ans = logical 1 ``` *Unicode characters round-trip* ```matlab >> matlabArray = ["😊"; "🌲"; "➞"] matlabArray = 3×1 string array "😊" "🌲" "➞" >> arrowArray = arrow.array.StringArray(matlabArray) arrowArray = [ "😊", "🌲", "➞" ] >> matlabArrayRoundTrip = toMATLAB(arrowArray) matlabArrayRoundTrip = 3×1 string array "😊" "🌲" "➞" ``` *Create `StringArray` from `cellstr`* ```matlab >> matlabArray = {'red'; 'green'; 'blue'} matlabArray = 3×1 cell array {'red' } {'green'} {'blue' } >> arrowArray = arrow.array.StringArray(matlabArray) arrowArray = [ "red", "green", "blue" ] >> matlabArrayRoundTrip = toMATLAB(arrowArray) matlabArrayRoundTrip = 3×1 string array "red" "green" "blue" ``` *Create `RecordBatch` from MATLAB `string` data* ```matlab >> matlabTable = table(["😊"; "🌲"; "➞"]) matlabTable = 3×1 table Var1 ____ "😊" "🌲" "➞" >> arrowRecordBatch = arrow.tabular.RecordBatch(matlabTable) arrowRecordBatch = Var1: [ "😊", "🌲", "➞" ] >> matlabTableRoundTrip = toMATLAB(arrowRecordBatch) matlabTableRoundTrip = 3×1 table Var1 ____ "😊" "🌲" "➞" >> isequaln(matlabTable, matlabTableRoundTrip) ans = logical 1 ``` ### Are these changes tested? Yes. 1. Added new `tStringArray` test class. 2. Added new `tStringType` test class. 3. Extended `tRecordBatch` test class to verify support for MATLAB `table`s which contain `string` data (see above). ### Are there any user-facing changes? Yes. 1. Users can now create `arrow.array.StringArray` objects from MATLAB `string` arrays and `cellstr`s. 2. Users can now create `arrow.type.StringType` objects. 3. Users can now construct `RecordBatch` objects from MATLAB `table`s that contain `string` data. ### Future Directions 1. The implementation of this initial version of `StringArray` is relatively simple in that it does not include a `BinaryArray` class hierarchy. In the future, we will likely want to refactor `StringArray` to inherit from a more general abstract `BinaryArray` class hierarchy. 2. Following on from 1., we will ideally want to add support for `LargeStringArray`, `BinaryArray`, and `LargeBinaryArray`, and `FixedLengthBinaryArray` by creating common infrastructure for representing binary types. This initial version of `StringArray` helps to solidify the user-facing design and provide a shorter term solution to working with `string` data, since it is quite common. 3. It may make sense to change the `arrow.type.Type` hierarchy (e.g. `arrow.type.StringType`) in the future to delegate to C++ `Proxy` classes under the hood. See: #36363. 4. Use `bit::unpacked_as_ptr` in other classes. See https://github.com/apache/arrow/issues/36335. 5. Look for more ways to optimize the conversion from MATLAB UTF-16 encoded string data to Arrow UTF-8 encoded string data (e.g. by avoiding unnecessary data copies). ### Notes 1. Thank you @ sgilmore10 for your help with this pull request! * Closes: #36250 Lead-authored-by: Kevin Gurney <kgurney@mathworks.com> Co-authored-by: Kevin Gurney <kevin.p.gurney@gmail.com> Co-authored-by: Sarah Gilmore <sgilmore@mathworks.com> Co-authored-by: Sutou Kouhei <kou@cozmixng.org> Co-authored-by: Sarah Gilmore <silgmore@mathworks.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>