Commits

Wes McKinney authored 8b9f6b9d28b
ARROW-10598: [C++] Separate out bit-packing in internal::GenerateBitsUnrolled for better performance This issue was pointed out to me by some folks at UWisc and just getting around to trying out a fix. This produces a substantial performance benefit in numerical comparisons with gcc8 (with sse4.2 only) ``` benchmark baseline contender change % counters 5 GreaterArrayArrayInt64/32768/0 414.735m items/sec 1.525b items/sec 267.679 {'iterations': 8792, 'null_percent': 0.0} 2 GreaterArrayArrayInt64/32768/1 411.275m items/sec 1.505b items/sec 266.053 {'iterations': 8676, 'null_percent': 100.0} 16 GreaterArrayArrayInt64/32768/100 406.580m items/sec 1.487b items/sec 265.846 {'iterations': 8639, 'null_percent': 1.0} 20 GreaterArrayArrayInt64/32768/2 411.188m items/sec 1.498b items/sec 264.320 {'iterations': 8700, 'null_percent': 50.0} 6 GreaterArrayArrayInt64/32768/10000 407.791m items/sec 1.458b items/sec 257.431 {'iterations': 8777, 'null_percent': 0.01} 17 GreaterArrayArrayInt64/32768/10 410.239m items/sec 1.438b items/sec 250.545 {'iterations': 8663, 'null_percent': 10.0} 11 GreaterArrayScalarInt64/32768/0 638.903m items/sec 1.882b items/sec 194.540 {'iterations': 13433, 'null_percent': 0.0} 14 GreaterArrayScalarInt64/32768/2 635.268m items/sec 1.866b items/sec 193.770 {'iterations': 13468, 'null_percent': 50.0} 9 GreaterArrayScalarInt64/32768/100 629.207m items/sec 1.838b items/sec 192.130 {'iterations': 13268, 'null_percent': 1.0} 15 GreaterArrayScalarInt64/32768/10 632.650m items/sec 1.848b items/sec 192.115 {'iterations': 13321, 'null_percent': 10.0} 3 GreaterArrayScalarInt64/32768/1 636.101m items/sec 1.858b items/sec 192.093 {'iterations': 13465, 'null_percent': 100.0} 19 GreaterArrayScalarInt64/32768/10000 630.470m items/sec 1.835b items/sec 191.010 {'iterations': 13336, 'null_percent': 0.01} 12 GreaterArrayArrayString/32768/1 295.949m items/sec 302.807m items/sec 2.317 {'iterations': 6323, 'null_percent': 100.0} 21 GreaterArrayScalarString/32768/2 274.259m items/sec 278.502m items/sec 1.547 {'iterations': 5847, 'null_percent': 50.0} 22 GreaterArrayScalarString/32768/1 1.007b items/sec 1.020b items/sec 1.244 {'iterations': 21309, 'null_percent': 100.0} 8 GreaterArrayArrayString/32768/2 148.096m items/sec 146.908m items/sec -0.802 {'iterations': 3172, 'null_percent': 50.0} 10 GreaterArrayArrayString/32768/100 104.447m items/sec 103.111m items/sec -1.280 {'iterations': 2231, 'null_percent': 1.0} 13 GreaterArrayArrayString/32768/10 105.158m items/sec 103.790m items/sec -1.300 {'iterations': 2246, 'null_percent': 10.0} 7 GreaterArrayArrayString/32768/10000 104.408m items/sec 102.958m items/sec -1.388 {'iterations': 2225, 'null_percent': 0.01} 4 GreaterArrayArrayString/32768/0 105.920m items/sec 103.817m items/sec -1.985 {'iterations': 2263, 'null_percent': 0.0} 18 GreaterArrayScalarString/32768/10 645.713m items/sec 614.437m items/sec -4.844 {'iterations': 13876, 'null_percent': 10.0} 1 GreaterArrayScalarString/32768/10000 977.632m items/sec 871.382m items/sec -10.868 {'iterations': 20892, 'null_percent': 0.01} 0 GreaterArrayScalarString/32768/0 999.585m items/sec 877.238m items/sec -12.240 {'iterations': 21302, 'null_percent': 0.0} 23 GreaterArrayScalarString/32768/100 934.922m items/sec 814.268m items/sec -12.905 {'iterations': 19892, 'null_percent': 1.0} ``` The difference is less pronounced with clang-8, but still present ``` benchmark baseline contender change % counters 18 GreaterArrayArrayInt64/32768/2 928.755m items/sec 1.655b items/sec 78.196 {'iterations': 19762, 'null_percent': 50.0} 6 GreaterArrayArrayInt64/32768/1 938.947m items/sec 1.666b items/sec 77.386 {'iterations': 19962, 'null_percent': 100.0} 4 GreaterArrayArrayInt64/32768/0 945.091m items/sec 1.658b items/sec 75.386 {'iterations': 20198, 'null_percent': 0.0} 0 GreaterArrayArrayInt64/32768/10 923.876m items/sec 1.610b items/sec 74.241 {'iterations': 19667, 'null_percent': 10.0} 23 GreaterArrayArrayInt64/32768/100 918.441m items/sec 1.598b items/sec 73.962 {'iterations': 19602, 'null_percent': 1.0} 2 GreaterArrayArrayInt64/32768/10000 914.122m items/sec 1.573b items/sec 72.085 {'iterations': 19437, 'null_percent': 0.01} 21 GreaterArrayScalarString/32768/2 243.973m items/sec 246.087m items/sec 0.866 {'iterations': 5224, 'null_percent': 50.0} 17 GreaterArrayScalarInt64/32768/1 2.736b items/sec 2.638b items/sec -3.590 {'iterations': 58090, 'null_percent': 100.0} 9 GreaterArrayArrayString/32768/0 136.733m items/sec 130.784m items/sec -4.351 {'iterations': 2905, 'null_percent': 0.0} 22 GreaterArrayArrayString/32768/100 136.434m items/sec 127.740m items/sec -6.372 {'iterations': 2918, 'null_percent': 1.0} 19 GreaterArrayArrayString/32768/10000 136.639m items/sec 127.704m items/sec -6.539 {'iterations': 2916, 'null_percent': 0.01} 15 GreaterArrayScalarInt64/32768/10000 2.693b items/sec 2.512b items/sec -6.742 {'iterations': 58669, 'null_percent': 0.01} 3 GreaterArrayScalarInt64/32768/0 2.795b items/sec 2.592b items/sec -7.248 {'iterations': 58721, 'null_percent': 0.0} 13 GreaterArrayArrayString/32768/10 139.225m items/sec 128.587m items/sec -7.641 {'iterations': 2980, 'null_percent': 10.0} 11 GreaterArrayScalarInt64/32768/2 2.759b items/sec 2.537b items/sec -8.040 {'iterations': 57925, 'null_percent': 50.0} 14 GreaterArrayScalarInt64/32768/10 2.767b items/sec 2.517b items/sec -9.045 {'iterations': 58391, 'null_percent': 10.0} 1 GreaterArrayScalarInt64/32768/100 2.738b items/sec 2.485b items/sec -9.246 {'iterations': 58094, 'null_percent': 1.0} 16 GreaterArrayArrayString/32768/1 794.894m items/sec 659.684m items/sec -17.010 {'iterations': 17020, 'null_percent': 100.0} 12 GreaterArrayArrayString/32768/2 182.007m items/sec 150.059m items/sec -17.553 {'iterations': 3905, 'null_percent': 50.0} 7 GreaterArrayScalarString/32768/1 1.093b items/sec 772.486m items/sec -29.297 {'iterations': 23230, 'null_percent': 100.0} 5 GreaterArrayScalarString/32768/10 691.609m items/sec 473.463m items/sec -31.542 {'iterations': 14691, 'null_percent': 10.0} 20 GreaterArrayScalarString/32768/100 1.002b items/sec 603.447m items/sec -39.761 {'iterations': 21340, 'null_percent': 1.0} 10 GreaterArrayScalarString/32768/0 1.054b items/sec 626.234m items/sec -40.574 {'iterations': 22461, 'null_percent': 0.0} 8 GreaterArrayScalarString/32768/10000 1.078b items/sec 605.930m items/sec -43.796 {'iterations': 23251, 'null_percent': 0.01} ``` I don't know why this seems to cause performance to be worse in the string comparison cases. There might be some other things to try here like bit-packing in larger batches. Closes #8671 from wesm/generate-bits-post-bitpack Authored-by: Wes McKinney <wesm@apache.org> Signed-off-by: Wes McKinney <wesm@apache.org>