Public / arrow / ee09cb6edce

Commits

Qingping Hou authored and Andy Grove committed ee09cb6edce20 May 2020
ARROW-8839: [Rust] [DataFusion] support CSV schema inference in logical plan

This PR changes schema argument for scan_csv method into `Option<&Schema>`. Other related changes are needed to make this happen including:

* added delimiter argument to all csv related structs and functions
* fixed a bug in schema field inference function
* made `arrow::csv::reader::infer_file_schema` public so it can be used by data fusion

Known limitations:
* when provided with a directory of csv files, schema inference code only reads rows from the first file.
* to avoid adding yet another argument to all csv related functions, i hard coded number of rows to read for schema inference to 1000

Open questions:
* Should we rename `datasource::csv::CsvFile` struct to `CsvTable` to keep it consistent with ParquetTable and MemoryTable? The implementation of CsvFile also supports reading from a directory of files, so `CsvFile` is not an accurate name.
* csv related function arguments are getting a bit long, should we introduce a csv option struct to capture the following configs with sensible defaults?
  - schema
  - has_header
  - delimiter
  - infer_max_read_records

Closes #7210 from houqp/csv_schema_infer

Authored-by: Qingping Hou <dave2008713@gmail.com>
Signed-off-by: Andy Grove <andygrove73@gmail.com>