Commits


Uwe L. Korn authored and Wes McKinney committed 97e69b47f06
PARQUET-820: Decoders should directly emit arrays with spacing for null entries Old: ``` In [3]: import pyarrow.io as paio ...: import pyarrow.parquet as pq ...: ...: with open('yellow_tripdata_2016-01.parquet', 'r') as f: ...: buf = f.read() ...: buf = paio.buffer_from_bytes(buf) ...: ...: def read_parquet(): ...: reader = paio.BufferReader(buf) ...: df = pq.read_table(reader) ...: ...: %timeit read_parquet() ...: 1 loop, best of 3: 1.21 s per loop ``` New: ``` In [1]: import pyarrow.io as paio ...: import pyarrow.parquet as pq ...: ...: with open('yellow_tripdata_2016-01.parquet', 'r') as f: ...: buf = f.read() ...: buf = paio.buffer_from_bytes(buf) ...: ...: def read_parquet(): ...: reader = paio.BufferReader(buf) ...: df = pq.read_table(reader) ...: ...: %timeit read_parquet() ...: 1 loop, best of 3: 906 ms per loop ``` Arrow->Pandas conversion for comparison: ``` In [5]: %timeit df.to_pandas() 1 loop, best of 3: 567 ms per loop ``` All benchmarks were done on a single core CPU I have to add a better test coverage before this can go in. There is still some room for future improvements that won't be done in this PR: * `DefinitionLevelsToBitmap` should be done in the DefinitionLevelsDecoder * `GetBatchWithDictSpaced` is something for a vectorization/bitmap ninja. Author: Uwe L. Korn <uwelk@xhochy.com> Author: Korn, Uwe <Uwe.Korn@blue-yonder.com> Closes #218 from xhochy/PARQUET-820 and squashes the following commits: e6db697 [Korn, Uwe] Add INIT_BITSET macro 8f17db9 [Korn, Uwe] Use arrow::TypeTraits 8dcab1b [Uwe L. Korn] Adjust documentation for ReadBatchSpaced 798bc83 [Uwe L. Korn] Test ReadSpaced 9dc6dc0 [Uwe L. Korn] Test DecodeSpaced ccb70dc [Uwe L. Korn] Add fast path for non-nullable-batches 6f99191 [Uwe L. Korn] Move bit reading into a macro 393d99a [Uwe L. Korn] Explicitly mark overrides 3424ae3 [Uwe L. Korn] Make more use of the bitmaps 685ad34 [Uwe L. Korn] Remove unused include 9b0f105 [Uwe L. Korn] Use bitset in the whole GetBatchWithDict loop 907c165 [Uwe L. Korn] Use bitset in literalbatch 0ec4b38 [Uwe L. Korn] Remove unused code f6c4b5e [Uwe L. Korn] ninja format cbf0176 [Uwe L. Korn] DecodeSpaced in dictionary encoder 3dfa43b [Uwe L. Korn] Directly read valid_bits 15aa324 [Uwe L. Korn] Only use ReadSpaced where needed 96dd347 [Korn, Uwe] PARQUET-820: Decoders should directly emit arrays with spacing for null entries Change-Id: Ibd4126fccbe70a54ac7c48c280bfa77ea2965205