Public / arrow / 92712af84e9

Commits

michalursa authored and Benjamin Kietzman committed 92712af84e931 Aug 2021
ARROW-12725: [C++][Compute] Column at a time hash and comparison in group by

Current implementation of hash group by (GrouperFastImpl class) converts input ExecBatches to row-oriented format,
then hashes and compares rows as if they were a single column.

It is more efficient to avoid relatively costly encoding and instead compute hashes of individual columns in column-oriented format mixing them together, and similarly comparing column-oriented data to row-oriented data in the hash table without converting.

Encoding only happens for a subset of input rows that are inserted into the hash table - they introduce new groups.
Keys in hash table remain stored as row-oriented.

There are 3 components that changed:
- key_compare.cc - contains implementation of one column at a time comparison between column-oriented and row-oriented data
- key_hash.cc - implements hashing individual columns of column-oriented key data and mixing the individual results together for a final multi-column hash
- key_encode.cc - implements encoding in row-format only a selection of input rows; previously all input rows were encoded without support for selective encoding

Performance before and after:
Here are selected results (in millions of rows per second) of arrow-compute-aggregate-benchmark for 0.01% nulls in input columns (Ubuntu VM, Clang 10.0, Intel I9-10980XE):
| Test | Before (AVX2) | After (AVX2) | Before (Scalar) | After (Scalar) |
| ------------- | ------------- | ------------- | ------------- | ------------- |
| TinyString  | 53  | 96 | 25 | 38 |
| SmallString  | 48 | 80 | 23 | 36 |
| MediumString  | 47 | 65 | 22 | 32 |
| TinyInt  | 340  | 407 | 130| 136 |
| SmallInt  | 246  | 287 | 131 | 137 |
| MediumInt  | 194 | 208 | 106 | 111 |

Closes #10290 from michalursa/ARROW-12725-column-at-a-time-groupby

Lead-authored-by: michalursa <michal@ursacomputing.com>
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>