Commits


michalursa authored and Benjamin Kietzman committed 92712af84e9
ARROW-12725: [C++][Compute] Column at a time hash and comparison in group by Current implementation of hash group by (GrouperFastImpl class) converts input ExecBatches to row-oriented format, then hashes and compares rows as if they were a single column. It is more efficient to avoid relatively costly encoding and instead compute hashes of individual columns in column-oriented format mixing them together, and similarly comparing column-oriented data to row-oriented data in the hash table without converting. Encoding only happens for a subset of input rows that are inserted into the hash table - they introduce new groups. Keys in hash table remain stored as row-oriented. There are 3 components that changed: - key_compare.cc - contains implementation of one column at a time comparison between column-oriented and row-oriented data - key_hash.cc - implements hashing individual columns of column-oriented key data and mixing the individual results together for a final multi-column hash - key_encode.cc - implements encoding in row-format only a selection of input rows; previously all input rows were encoded without support for selective encoding Performance before and after: Here are selected results (in millions of rows per second) of arrow-compute-aggregate-benchmark for 0.01% nulls in input columns (Ubuntu VM, Clang 10.0, Intel I9-10980XE): | Test | Before (AVX2) | After (AVX2) | Before (Scalar) | After (Scalar) | | ------------- | ------------- | ------------- | ------------- | ------------- | | TinyString | 53 | 96 | 25 | 38 | | SmallString | 48 | 80 | 23 | 36 | | MediumString | 47 | 65 | 22 | 32 | | TinyInt | 340 | 407 | 130| 136 | | SmallInt | 246 | 287 | 131 | 137 | | MediumInt | 194 | 208 | 106 | 111 | Closes #10290 from michalursa/ARROW-12725-column-at-a-time-groupby Lead-authored-by: michalursa <michal@ursacomputing.com> Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>