Commits


alamb authored and Chao Sun committed 25b0b1b281b
ARROW-9790: [Rust][Parquet] Fix PrimitiveArrayReader boundary conditions When I was reading a parquet file into `RecordBatches` using `ParquetFileArrowReader` that had row groups that were 100,000 rows in length with a batch size of 60,000, after reading 300,000 rows successfully, I started seeing this error ``` ParquetError("Parquet error: Not all children array length are the same!") ``` Upon investigation, I found that when reading with `ParquetFileArrowReader`, if the parquet input file has multiple row groups, and if a batch happens to end at the end of a row group for Int or Float, no subsequent row groups are read Visually: ``` +-----+ | RG1 | | | +-----+ <-- If a batch ends exactly at the end of this row group (page), RG2 is never read +-----+ | RG2 | | | +-----+ ``` I traced the issue down to a bug in `PrimitiveArrayReader` where it mistakenly interprets reading `0` rows from a page reader as being at the end of the column. This bug appears *not* to be present in the initial implementation #5378 -- FYI @andygrove and @liurenjie1024 (the test harness in this file is awesome, btw), but was introduced in https://github.com/apache/arrow/pull/7140. I will do some more investigating to ensure the test case described in that ticket is handled Closes #8007 from alamb/alamb/ARROW-9790-record-batch-boundaries Authored-by: alamb <andrew@nerdnetworks.org> Signed-off-by: Chao Sun <sunchao@apache.org>