Commits


Ziheng Wang authored and GitHub committed ec7e250cf8f
ARROW-17299: [C++][Python] Expose the Scanner kDefaultBatchReadahead and kDefaultFragmentReadahead parameters (#13799) This exposes the Fragment Readahead and Batch Readahead flags in the C++ Scanner to the user in Python. This can be used to finetune RAM usage and IO utilization during downloading large files from S3 or other network sources. I believe the default settings are overly conservative for small RAM settings and I observe less than 20% IO utilization on some instances on AWS. The Python API is exposed only to methods where these flags make sense. Scanning from a RecordBatchIterator won't need those these flags nor will those flags make sense. Only the latter flag makes sense for making a scanner from a fragment. To test this, set up an i3.2xlarge instance on AWS: ``` import pyarrow import pyarrow.dataset as ds import pyarrow.csv as csv import time pyarrow.set_cpu_count(8) pyarrow.set_io_thread_count(16) lineitem_scheme = ["l_orderkey","l_partkey","l_suppkey","l_linenumber","l_quantity","l_extendedprice", "l_discount","l_tax","l_returnflag","l_linestatus","l_shipdate","l_commitdate","l_receiptdate","l_shipinstruct", "l_shipmode","l_comment", "null"] csv_format = ds.CsvFileFormat(read_options=csv.ReadOptions(column_names=lineitem_scheme, block_size= 32 * 1024 * 1024), parse_options=csv.ParseOptions(delimiter="|")) dataset = ds.dataset("s3://TPC",format=csv_format) s = dataset.to_batches(batch_size=1000000000) while count < 100: z = next(s) ``` For our purposes let's just make the TPC dataset consist of hundreds of Parquet files each with one row group. (something that Spark would generate). This script would get somewhere around 1Gbps. If you now do ``` s = dataset.to_batches(batch_size=1000000000, fragment_readahead=16) ``` You can get to 2.5Gbps which is the advertised steady rate cap for this instance type. Authored-by: Ziheng Wang <zihengw@stanford.edu> Signed-off-by: Antoine Pitrou <antoine@python.org>