Commits

Weston Pace authored 9abd2b14081
ARROW-14192: [C++][Dataset] Backpressure broken on ordered scans While scanning we do our best to readahead multiple files so we will read files 1, 2, 3, and 4 all at the same time. This helps to maintain bandwidth when some files hit a snag (sometimes happens on AWS). However, when doing an ordered scan, this can cause backpressure to explode when there is a slow consumer. The sequencer (placed at the end of the pipeline) can get into a situation where it pulls aggressively from files 2, 3, and 4 while waiting for the next chunk from file 1. Since the sequencer is consuming the batches the backpressure mechanism thinks they are being consumed. However, the actual consumer is leaving the batches piling up at the sequencer. This PR introduces one possible solution (and it may be the only possible solution) which is to sequence the batches at merge time (early in the pipeline). The sequencer won't need to pull aggressively and backpressure will be maintained. This pretty significantly reduces (but does not eliminate) the amount of file readahead we do in ordered scans. We can worry about that if it ends up being a bottleneck at some point but for now I think it is better we do not explode RAM. This builds on ARROW-13611 and will remain in draft until that PR has merged. Closes #11294 from westonpace/feature/ARROW-14192--backpressure-ordered-scan Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>