Commits


Jorge C. Leitao authored and Andy Grove committed bc987cdba0d
ARROW-9922: [Rust] Add StructArray::TryFrom (+40%) The core problem that this PR addresses is the construction of a `StructArray`, whose spec can be found [here](https://arrow.apache.org/docs/format/Columnar.html#struct-layout). The current API to build a `StructArray` of 4 entries of fixed type is (part of a test): ```rust let string_builder = StringBuilder::new(4); let int_builder = Int32Builder::new(4); let mut fields = Vec::new(); let mut field_builders = Vec::new(); fields.push(Field::new("f1", DataType::Utf8, false)); field_builders.push(Box::new(string_builder) as Box<ArrayBuilder>); fields.push(Field::new("f2", DataType::Int32, false)); field_builders.push(Box::new(int_builder) as Box<ArrayBuilder>); let mut builder = StructBuilder::new(fields, field_builders); assert_eq!(2, builder.num_fields()); let string_builder = builder .field_builder::<StringBuilder>(0) .expect("builder at field 0 should be string builder"); string_builder.append_value("joe").unwrap(); string_builder.append_null().unwrap(); string_builder.append_null().unwrap(); string_builder.append_value("mark").unwrap(); let int_builder = builder .field_builder::<Int32Builder>(1) .expect("builder at field 1 should be int builder"); int_builder.append_value(1).unwrap(); int_builder.append_value(2).unwrap(); int_builder.append_null().unwrap(); int_builder.append_value(4).unwrap(); builder.append(true).unwrap(); builder.append(true).unwrap(); builder.append_null().unwrap(); builder.append(true).unwrap(); let arr = builder.finish(); ``` This PR's proposal for the same array: ```rust let strings: ArrayRef = Arc::new(StringArray::from(vec![ Some("joe"), None, None, Some("mark"), ])); let ints: ArrayRef = Arc::new(Int32Array::from(vec![Some(1), Some(2), None, Some(4)])); let arr = StructArray::try_from(vec![("f1", strings.clone()), ("f2", ints.clone())]).unwrap(); ``` Note that: * There is no `Field`, only name: the attributes (type and nullability) are obtained from the `ArrayData`'s itself, and thus there a guarantee that the field's attributes are aligned with the Data. * The implementation is dynamically typed: the type is obtained from `Array::data_type`, instead of having to match Field's datatype to each field' builders * `Option` is used to specify whether the quantity is null or not The construction uses an OR on the entry's null bitmaps to decide whether the struct null bitmap is null at a given index. I.e. the third index of the example in [the spec](https://arrow.apache.org/docs/format/Columnar.html#struct-layout) is obtained by checking if all fields are null at that index. There is an edge case, that this constructor is unable to build (and the user needs to use the other `From`): a struct with a `0` at position X and all field's bitmap at position X to be `1`: ``` # array of 1 entry: bitmap struct = [0] bitmap field1 = [1] bitmap field2 = [1] ``` this is because, in this `TryFrom`, the bitmap of the struct is computed from a bitwise `or` of the field's entries. IMO this is a non-issue because a `null` in the struct already implies an `unspecified` value on every field and thus that field's value is already assumed to be undefined. However, this is important to mention as a round-trip with this case will fail: in the example above, `bitmap struct` will have a `1`. Finally, this has a performance improvement of 40%. <details> <summary>Benchmark results</summary> ``` git checkout HEAD^ && cargo bench --bench array_from_vec -- struct_array_from_vec && git checkout no_builder1 && cargo bench --bench array_from_vec -- struct_array_from_vec ``` ``` struct_array_from_vec 128 time: [7.7464 us 7.7586 us 7.7731 us] change: [-39.227% -38.313% -37.128%] (p = 0.00 < 0.05) Performance has improved. Found 11 outliers among 100 measurements (11.00%) 5 (5.00%) high mild 6 (6.00%) high severe struct_array_from_vec 256 time: [9.3386 us 9.3611 us 9.3896 us] change: [-45.035% -44.498% -43.914%] (p = 0.00 < 0.05) Performance has improved. Found 12 outliers among 100 measurements (12.00%) 1 (1.00%) low mild 4 (4.00%) high mild 7 (7.00%) high severe struct_array_from_vec 512 time: [13.107 us 13.148 us 13.199 us] change: [-49.213% -48.705% -48.208%] (p = 0.00 < 0.05) Performance has improved. Found 16 outliers among 100 measurements (16.00%) 2 (2.00%) low mild 4 (4.00%) high mild 10 (10.00%) high severe struct_array_from_vec 1024 time: [20.036 us 20.061 us 20.087 us] change: [-54.254% -53.479% -52.776%] (p = 0.00 < 0.05) Performance has improved. Found 7 outliers among 100 measurements (7.00%) 1 (1.00%) high mild 6 (6.00%) high severe ``` </details> ## Final note: The general direction that I am heading with this is to minimize the usage of builders. My issue with builders is that they are statically typed and perform incremental changes, but almost all our operations are dynamically typed and in bulk: batch read, batch write, etc. As such, it is often faster (and much simpler from UX's perspective) to create a `Vec<Option<_>>` and use it to create an Arrow Array. FYI @nevi-me @andygrove @alamb Closes #8118 from jorgecarleitao/no_builder1 Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Andy Grove <andygrove@nvidia.com>