Public / arrow / c697a41ab9c

Commits

Michal Nowakiewicz authored and Benjamin Kietzman committed c697a41ab9c21 May 2021

ARROW-12010: [C++][Compute] Improve performance of the hash table used in GroupIdentifier

This is the draft version of the code implementing functionality for mapping arbitrary set of input columns considered a key in grouping operation into a vector containing integer group identifiers (same combinations of input key columns get same ids).

I will continue working on it and updating it with:
- integration with initial hash group by implementation in Arrow project, once it is finished and merged into master
- unit tests
- documentation

At this point group ids, row ids, offsets, hash values are 32-bit. The overflow checks are missing in current version and still need to be fixed.

The entry point for id mapping is GroupBy class. It uses three main modules: storage defined in groupby_storage* files, hash defined in groupby_hash* files and hash table defined in groupby_map* files. Key values stored with the hash table are row oriented. Storage part of the code defines functions converting from column oriented storage to row oriented storage and back. It also implements comparison and appending keys to the incremental store.

I plan to add design doc in a form of a readme file later on.

The individual modules and functions present here have been tested with unit tests and are passing them but unit tests are not included in this change yet.

Closes #9768 from michalursa/ARROW-12010-GroupIdentifier

Lead-authored-by: "Michal Nowakiewicz <michal@ursacomputing.com>"
Co-authored-by: michalursa <michal@ursacomputing.com>
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>