Commits


Thomas Newton authored and GitHub committed 894f5fb721b
GH-15233: [C++] Fix CopyFiles when destination is a FileSystem with background_writes (#44897) ### Rationale for this change Currently it deadlocks when trying to upload more files than we have IO threads. ### What changes are included in this PR? 1. Use CloseAsync() inside copy_one_file so that it returns futures, instead of blocking while it waits for other tasks to complete. This avoids the deadlock. 1. Fix a segfault in FileInterface::CloseAsync() by using shared_from_this() to so that it doesn't get destructed prematurely. shared_from_this() is already used in the CloseAsync() implementation on several filesystems. After change 1 this is important when downloading to the local filesystem using CopyFiles. 1. Spawn `copy_one_file` less urgently than default, so that `background_writes` are done with higher priority. Otherwise `copy_one_file` will keep buffering more data in memory without giving the `background_writes` any chance to upload the data and drop it from memory. Therefore, without this large copies would cause OOMs. 1. The current Arrow thread pool implementation makes provision for spawning with a set priority but the priority numbers are just ignored. 1. I made a small change to use a `std::priority_queue` to implement the priority. 1. There is a property of the current implementation that tasks will be scheduled in the order of calls to Spawn. There are also a few test cases in `arrow-acero-hash-aggregate-test` that fail if this property is not maintained. I'm not sure what is the correct option but for now I have ensured that tasks of the same priority are run in the order they are spawned. 1. I think this has got to be the most contentious part of the change, with the broadest possible impact if I introduce a bug. I would appreciate some input on whether this seems like a reasonable change and I'm very happy to move it to a separate PR and/or discuss further. ### Are these changes tested? Added some extra automated tests. I also ran some large scale tests copying between Azure and local with a directory of about 50,000 files of varying sizes totalling 160GiB. ### Are there any user-facing changes? 1. `CopyFiles` should now work reliably 2. `ThreadPool` now runs tasks in priority order * GitHub Issue: #15233 Lead-authored-by: Thomas Newton <thomas.w.newton@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org>