Public / arrow / b96cd3604d1

Commits

Yibo Cai authored and Krisztián Szűcs committed b96cd3604d118 Apr 2020

ARROW-8496: [C++] Refine ByteStreamSplitDecodeScalar

I simplified DecoderScalar code and see huge performance boost from
clang generated code. Per my test on Intel E5-2650 with clang-9,
Decode_Float_Scalar test jumps from 600M/s to 20G/s, even better
than SSE version(17G/s). Similar behaviour observed on Arm64.

Some digging shows clang auto vectorized the simplified decoder code,
but gcc cannot: https://godbolt.org/z/kq9FAs
Interestingly, gcc is able to auto vectorize EncoderFloatScalar code,
but clang cannot: https://godbolt.org/z/E3LnZD

NOTE: This scalar code is not tested in default x86_64 build, which
goes the SSE version. Arm64 build goes this scalar code path.

Closes #6962 from cyb70289/bytesplit

Authored-by: Yibo Cai <yibo.cai@arm.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>