Commits


Florian Jetter authored and GitHub committed c57115de8d9
GH-40142: [Python] Allow FileInfo instances to be passed to dataset init (#40143) ### Rationale for this change Closes https://github.com/apache/arrow/issues/40142 I'm developing a new dask integration with pyarrow parquet reader (see https://github.com/dask-contrib/dask-expr/pull/882) and want to rely on the pyarrow Filesystem more. Right now, we are performing a list operation ourselves to get all touched files and I would like to pass the retrieved `FileInfo` objects directly to the dataset constructor. This API is already exposed in C++ and this PR is adding the necessary python bindings. The benefit of this is that there is API is that it cuts the need to perform additional HEAD requests to a remote storage. This came up in https://github.com/apache/arrow/issues/38389#issuecomment-1774777681 and there's been related work already with https://github.com/apache/arrow/issues/37857 ### What changes are included in this PR? Python bindings for the `DatasetFactory` constructor that accepts a list/vector of `FileInfo` objects. ### Are these changes tested? ~I slightly modified the minio test setup such that the prometheus endpoint is exposed. This can be used to assert that there hasn't been any HEAD requests.~ I ended up removing this again since parsing the response is a bit brittle. ### Are there any user-facing changes? * Closes: #40142 Lead-authored-by: fjetter <fjetter@users.noreply.github.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>