Public / arrow / 7ad49eeca52

Commits

6e844219338
Wes McKinney authored 7ad49eeca5224 May 2020
ARROW-8792: [C++][Python][R][GLib] New Array compute kernels implementation and execution framework

This patch is a major reworking of our development strategy for implementing array-valued functions and applying them in a query processing setting.

The design was partly inspired by my previous work designing Ibis (https://github.com/ibis-project/ibis -- the "expr" subsystem and the way that operators validate input types and resolve output types). Using only function names and input types, you can determine the output types of each function and resolve the "execute" function that performs a unit of work processing a batch of data. This will allow us to build deferred column expressions and then (eventually) do parallel execution.

There are a ton of details, but one nice thing is that there is now a single API entry point for invoking any function by its name:

```c++
Result<Datum> CallFunction(ExecContext* ctx, const std::string& func_name,
                           const std::vector<Datum>& args,
                           const FunctionOptions* options = NULLPTR);
```

What occurs when you do this:

* A `Function` instance is looked up in the global `FunctionRegistry`
* Given the descriptors of `args` (their types and shapes -- array or scalar), the Function searches for `Kernel` that is able to process those types and shapes. A kernel might be able to do `array[T0], array[T1]` or only `scalar[T0], scalar[T1]`, for example. This permits kernel specialization to treat different type and shape combinations
* The kernel is executed iteratively against `args` based on what `args` contains -- if there are ChunkedArrays, they will be split into contiguous pieces. Kernels never see ChunkedArray, only Array or Scalar
* The Executor implementation is able to split contiguous Array inputs into smaller chunks, which is important for parallel execution. See `ExecContext::set_exec_chunksize`

To summarize: the REGISTRY contains FUNCTIONS. A FUNCTION contains KERNELS. A KERNEL is a specific implementation of a function that services a particular type combination.

An additional effort in this patch is to radically simplify the process of creating kernels that are based on a scalar function. To do this, there is a growing collection of template-based kernel generation classes in compute/kernels/codegen_internal.h that will surely be the topic of much debate. I want to make it a lot easier for people to add new kernels.

There are some other incidental changes in the PR, such as changing the convenience APIs like `Cast` to return `Result`. I'm afraid we may have to live with the API breakage unless someone else wants to add backward compatibility code for the old APIs.

I have to apologize for making such a large PR. I've been working long hours on this for nearly a month and the process of porting all of our existing functionality and making the unit tests pass caused much iteration in the "framework" part of the code, such that it would have been a huge time drain to review incomplete iterations of the framework that had not been proven to capture the functionality that previously existed in the project.

Given the size of this PR and that fact that it completely blocks any work into src/arrow/compute, I don't think we should let this sit unmerged for more than 4 or 5 days, tops. I'm committed to responding to all of your questions and working to address your feedback about the design and improving the documentation and code comments. I tried to leave copious comments to explain my thought process in various places. Feel free to make any and all comments in this PR or in whatever form you like. I don't think that merging should be blocked on stylistic issues.

Closes #7240 from wesm/ARROW-8792-kernels-revamp

Lead-authored-by: Wes McKinney <wesm+git@apache.org>
Co-authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Wes McKinney <wesm+git@apache.org>