Commits

Wes McKinney authored c7baa7c4dc4
ARROW-9029: [C++] Implement BitBlockCounter for much faster block popcounts of bitmaps The purpose of this class is to scan validity bitmaps in segments of 256 bits at a time (a "run") and return the number of true values using popcount hardware instrinsics. Processing code can then switch between nullable / non-nullable processing paths -- the non-nullable paths are often much faster as they don't have to branch or check individual bits. In the benchmark I wrote here, this strategy starts to become faster than using BitmapReader naively with a null density somewhere between 2% and 10%. I implemented a naive "sum non-null values" algorithm using BitBlockCounter versus a similarly naive version that uses BitmapReader. The benchmark state parameter is the average number of array values for each null ``` --------------------------------------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------------------------------------- BitBlockCounterSumNotNull/8 1566649 ns 1566653 ns 450 638.304M items/s BitBlockCounterSumNotNull/64 926258 ns 926257 ns 745 1079.61M items/s BitBlockCounterSumNotNull/512 355224 ns 355224 ns 1978 2.74915G items/s BitBlockCounterSumNotNull/4096 77070 ns 77070 ns 8597 12.6711G items/s BitBlockCounterSumNotNull/32768 20838 ns 20838 ns 32902 46.8638G items/s BitBlockCounterSumNotNull/65536 18538 ns 18538 ns 37300 52.6791G items/s BitBlockCounterSumNotNullWithOffset/8 1563335 ns 1563346 ns 426 639.654M items/s BitBlockCounterSumNotNullWithOffset/64 831714 ns 831720 ns 825 1.17415G items/s BitBlockCounterSumNotNullWithOffset/512 339929 ns 339931 ns 2035 2.87283G items/s BitBlockCounterSumNotNullWithOffset/4096 75150 ns 75148 ns 9349 12.9952G items/s BitBlockCounterSumNotNullWithOffset/32768 28726 ns 28727 ns 24736 33.9948G items/s BitBlockCounterSumNotNullWithOffset/65536 26921 ns 26921 ns 26026 36.2753G items/s BitmapReaderSumNotNull/8 1897087 ns 1897098 ns 368 527.121M items/s BitmapReaderSumNotNull/64 1050133 ns 1050134 ns 669 952.26M items/s BitmapReaderSumNotNull/512 960722 ns 960728 ns 744 1040.88M items/s BitmapReaderSumNotNull/4096 949578 ns 949584 ns 727 1053.09M items/s BitmapReaderSumNotNull/32768 946948 ns 946955 ns 722 1056.02M items/s BitmapReaderSumNotNull/65536 960649 ns 960637 ns 739 1040.98M items/s BitmapReaderSumNotNullWithOffset/8 1972476 ns 1972457 ns 350 506.982M items/s BitmapReaderSumNotNullWithOffset/64 1131682 ns 1131691 ns 636 883.634M items/s BitmapReaderSumNotNullWithOffset/512 991733 ns 991736 ns 729 1008.33M items/s BitmapReaderSumNotNullWithOffset/4096 982855 ns 982862 ns 719 1017.44M items/s BitmapReaderSumNotNullWithOffset/32768 983555 ns 983556 ns 691 1016.72M items/s BitmapReaderSumNotNullWithOffset/65536 1005651 ns 1005658 ns 701 994.374M items/s ``` So we can see that the performance is around the same when 1 in 8 values is null, but when 1 out of 512 is null, the block-counter version is 3x faster. And the performance goes up from there, up to 50x faster on data that has nulls but not very many. In my experience, data with < 1% nulls is extremely common, much more so than data with 5% or more nulls. This is obviously a tradeoff but IMHO one worth making. As a bonus, `BitBlockCounter` doesn't inline any code. The implementation of this can probably be improved and the benchmark as well so I welcome your collective help with this. Closes #7346 from wesm/ARROW-9029 Lead-authored-by: Wes McKinney <wesm@apache.org> Co-authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm@apache.org>