Public / arrow / 053cd2354cf

Commits

Wes McKinney authored 053cd2354cf13 Jun 2019
ARROW-5512: [C++] Rough API skeleton for C++ Datasets API / framework

This is almost exclusively header files, so I caution all against debating small details like function signatures, names, or what kind of smart pointer to use (if any) in various places. Instead, does the high level structure seem reasonable (or at least, _not horrible_) as a starting point to do more work?

Some of the naming of things is inspired by related concepts in Apache Iceberg (incubating) (https://github.com/apache/incubator-iceberg), which is a vertically integrated dataset metastore and reading and writing system specialized for that metastore.

Here is the basic idea:

* A Dataset (for reading, aka "scanning") consists of a schema (what kind of data you expect to receive) and one or more data sources
* A DataSource abstractly yields an iterator of DataFragment
* A DataFragment represents a roughly individual storage unit, like a file

As many interfaces involving collections are based around Iterators so that we have the option of implementating "lazy" Datasets that continue to discover their structure after we are already scanning. It is a common problem in data warehousing that creating a detailed manifest of what needs to be scanned grows linearly in time with the complexity of the dataset (e.g. the number of fragments).

I abstracted away the file-related logic from the high level interface since I would like to support other kinds of data sources other than file-based ones:

* Flight streams: each endpoint from a DoGet operation in Flight corresponds to a DataFragment
* Database-like clients: e.g. the results of a SQL query form a Fragment

There's some object layering issues that aren't worked out yet, and I think the only way to work them out is to work on implementation and refactor until we get things feeling right:

* It is the job of a FileFormat implementation to translate between
* Filtering can occur both at the Partition/Fragment level (i.e. "skip these files altogether") as well as at the post-materialization stage. In Iceberg these "post-materialization" filters are called "Residuals". For example, if the user wants `filter1 & filter2` to be applied and only `filter1` can be handled by the low-level file deserialization, we will have to apply `filter2` against the unfiltered in-memory RecordBatch, returning the filtered RecordBatch to the user

As another matter, this objective of this framework is to draw a distinction between the Schema of a file and the Schema of the Dataset. This isn't reflected fully in the headers yet. To give an example, suppose that we wish to obtain a Dataset with schema

```
a: int64 nullable
b: double nullable
c: string nullable
```

When reading files in the Dataset, we might encounter fields are won't want, or fields that are missing. We must _conform_ the physical data to the Dataset's desired Schema. Much of the hard labor will be in the file format implementations, so match up what's in the file with what the Dataset wants. We also must deal with other kinds of schema normalization issues, like one Parquet file having a field as "non-nullable" when the desired schema is "nullable".

Inferring the Schema of a Dataset when you don't know it outright is a whole separate matter. If you go to Scan a dataset without knowing it, you must necessarily do some amount of inference up front or just prior to scanning. We will need to offer both "low effort" (look at some, but not files, and do not expend too much energy on it -- e.g. in the case of CSV files you may reach a conclusion without parsing an entire file) and "high effort / exhaustive" Schema inference.

As far as the actual Scan execution we are likely to immediately suffer some thread scheduling issues when trying to Scan files in parallel as internally IO and CPU work is coordinated. The file reader implementations have their own internal parallelism so that's something to contemplate as well.

In any case, I suggest we start small by creating minimalistic interfaces to CSV and Parquet files to start, implement simple dataset discovery as we have now in pyarrow/parquet.py, but a bit more general, and then we can investigate the various more advanced features as described in https://docs.google.com/document/d/1bVhzifD38qDypnSjtf8exvpP3sSB5x_Kw9m-n66FB2c/edit piece by piece.

Author: Wes McKinney <wesm+git@apache.org>

Closes #4483 from wesm/datasets-api-prototype and squashes the following commits:

2f6440a2e <Wes McKinney> Remove not-currently-needed enum, add comment about an example partition structure
68712f870 <Wes McKinney> Fix clang warnings, test does not compile on Windows yet
ceec07bf9 <Wes McKinney> Finish some initial skeleton prototyping
20b8f4b28 <Wes McKinney> Compile a simple unit test
895a03ee6 <Wes McKinney> Checkpoint
01c4279a7 <Wes McKinney> Checkpoint
74bd283a1 <Wes McKinney> Begin API drafting