Commits


Micah Kornfield authored and Wes McKinney committed e0a9d0f28af
ARROW-8504: [C++] Add BitRunReader and use it in parquet Adds two implementations of a BitRunReader, which returns set/not-set and number of bits in a row. - Adds benchmarks comparing the two implementations under different distributions. - Adds the reader for use ParquetWriter (there is a second location on Nullable terminal node that I left unchanged because it showed a performance drop of 30%, I think this is due the issue described in the next bullet point, or because BitVisitor is getting vectorized to something more efficient). - Refactors GetBatchedSpaced and GetBatchedSpacedWithDict: 1. Use a single templated method that adds a template parameter that the code can share. 2. Does all checking for out of bounds indices in one go instead of on each pass through th literal (this is a slight behavior change as the index returned will be different). 3. Makes use of the BitRunReader. Based on bechmarks BM_ColumnRead this seems to make performance worse by 50-80%. This was surprising to me and my current working theory is that the nullable benchmarks present the worse case scenario every other element is null and therefore the overhead of invoking the call relative to the existing code is high (using perf calls to NextRun() jump to top after this change). Before making a decision on reverting the use of BitRunReader here I'd like to implement a more comprehensive set of benchmarks to test this hypothesis. Other TODOs - [x] Need to revert change to testing submodule hash - [x] Add attribution to wikipedia for InvertRemainingBits - [x] Fix some unintelligible comments. - [x] Resolve performance issues. Closes #7143 from emkornfield/ARROW-8504 Authored-by: Micah Kornfield <emkornfield@gmail.com> Signed-off-by: Wes McKinney <wesm@apache.org>