Commits


Jorge C. Leitao authored and Neville Dipale committed 49e5b465346
ARROW-10010: [Rust] Speedup arithmetic (1.3-1.9x) This PR speeds-up arithmetic ops by leveraging vectorization of non-divide operations (in non-SIMD), as well as removing an un-needed operation in SIMD division. For non-SIMD, this yields about `[-30%,-45%]` for all operations (`+-*/`) For SIMD, this yields about `-30%` on division. The culprit in non-SIMD was that we required the operation to return `Result<T::Native>`, which was not allowing the compiler to vectorize the operation. Only the division requires `Result`. For divide, removing the operator further speed up the operation (I do not know the reason). The culprit in SIMD was primarily a `simd_load` too many that was not doing anything. ## Benchmarks The benchmark used: ``` set -e git checkout 0852869d1a9b7da4a1b91fa7cb7d4ef48e99cdec cargo bench --bench arithmetic_kernels git checkout divide_simd_faster cargo bench --bench arithmetic_kernels echo "##################################" git checkout 0852869d1a9b7da4a1b91fa7cb7d4ef48e99cdec cargo bench --bench arithmetic_kernels --features simd git checkout divide_simd_faster cargo bench --bench arithmetic_kernels --features simd ``` and below are the results for the execution of the second `bench`, which is the one that gives the differential, in my machine: ### Non-SIMD ``` Previous HEAD position was 0852869d1 Improved benches for arithmetic. Switched to branch 'divide_simd_faster' Compiling arrow v2.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow) Finished bench [optimized] target(s) in 37.24s Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/arithmetic_kernels-d281862a43faaf38 Gnuplot not found, using plotters backend add 512 time: [1.4714 us 1.4758 us 1.4803 us] change: [-44.446% -43.969% -43.522%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 5 (5.00%) high severe subtract 512 time: [1.4825 us 1.4844 us 1.4866 us] change: [-45.351% -45.018% -44.686%] (p = 0.00 < 0.05) Performance has improved. Found 9 outliers among 100 measurements (9.00%) 5 (5.00%) high mild 4 (4.00%) high severe multiply 512 time: [1.4895 us 1.4936 us 1.4990 us] change: [-44.822% -44.135% -43.479%] (p = 0.00 < 0.05) Performance has improved. Found 9 outliers among 100 measurements (9.00%) 4 (4.00%) high mild 5 (5.00%) high severe divide 512 time: [1.9742 us 1.9773 us 1.9810 us] change: [-33.273% -32.688% -32.052%] (p = 0.00 < 0.05) Performance has improved. Found 14 outliers among 100 measurements (14.00%) 7 (7.00%) high mild 7 (7.00%) high severe limit 512, 512 time: [374.66 ns 375.64 ns 376.53 ns] change: [-0.1000% +0.4442% +0.9503%] (p = 0.10 > 0.05) No change in performance detected. Found 8 outliers among 100 measurements (8.00%) 2 (2.00%) low severe 2 (2.00%) low mild 2 (2.00%) high mild 2 (2.00%) high severe add_nulls_512 time: [1.4880 us 1.4982 us 1.5115 us] change: [-44.084% -43.116% -42.111%] (p = 0.00 < 0.05) Performance has improved. Found 16 outliers among 100 measurements (16.00%) 3 (3.00%) high mild 13 (13.00%) high severe divide_nulls_512 time: [1.9731 us 1.9758 us 1.9790 us] change: [-33.404% -32.570% -31.416%] (p = 0.00 < 0.05) Performance has improved. Found 8 outliers among 100 measurements (8.00%) 2 (2.00%) high mild 6 (6.00%) high severe ``` ### SIMD divide is the only relevant ``` Previous HEAD position was 0852869d1 Improved benches for arithmetic. Switched to branch 'divide_simd_faster' Compiling arrow v2.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow) Finished bench [optimized] target(s) in 38.63s Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/arithmetic_kernels-b8dc1739cfb5ae36 Gnuplot not found, using plotters backend add 512 time: [879.31 ns 883.95 ns 889.17 ns] change: [-0.2041% +0.6502% +1.5484%] (p = 0.15 > 0.05) No change in performance detected. Found 16 outliers among 100 measurements (16.00%) 5 (5.00%) high mild 11 (11.00%) high severe subtract 512 time: [864.99 ns 866.95 ns 868.95 ns] change: [-4.8531% -4.1561% -3.5163%] (p = 0.00 < 0.05) Performance has improved. Found 7 outliers among 100 measurements (7.00%) 2 (2.00%) high mild 5 (5.00%) high severe multiply 512 time: [862.85 ns 864.87 ns 867.71 ns] change: [-3.8532% -3.1774% -2.4459%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 5 (5.00%) high severe divide 512 time: [1.9703 us 1.9771 us 1.9843 us] change: [-30.046% -29.457% -28.903%] (p = 0.00 < 0.05) Performance has improved. Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high severe limit 512, 512 time: [368.89 ns 369.96 ns 370.96 ns] change: [-1.9574% -1.0063% -0.0347%] (p = 0.04 < 0.05) Change within noise threshold. Found 26 outliers among 100 measurements (26.00%) 5 (5.00%) low severe 6 (6.00%) low mild 9 (9.00%) high mild 6 (6.00%) high severe add_nulls_512 time: [871.97 ns 876.99 ns 883.57 ns] change: [-5.1106% -3.6889% -2.3080%] (p = 0.00 < 0.05) Performance has improved. Found 8 outliers among 100 measurements (8.00%) 2 (2.00%) high mild 6 (6.00%) high severe divide_nulls_512 time: [1.9582 us 1.9625 us 1.9678 us] change: [-34.188% -33.161% -32.136%] (p = 0.00 < 0.05) Performance has improved. Found 8 outliers among 100 measurements (8.00%) 2 (2.00%) high mild 6 (6.00%) high severe ``` Closes #8191 from jorgecarleitao/divide_simd_faster Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>