Public / arrow / cc63a5da029

Commits

Neal Richardson authored and GitHub committed cc63a5da02927 Jul 2022
ARROW-16612: [R] Fix compression inference from filename (#13625)

This is actually a much larger change than the original issue. 

* ~Infer compression from the file extension in `write_parquet()` and pass it to ParquetFileWriter rather than write to a CompressedOutputStream, and don't wrap the in a CompressedInputStream in `read_parquet()` because that doesn't work (and isn't how compression works for Parquet). Previously, reading from a file with extension `.parquet.gz` etc. would error unless you opened an input stream yourself. This is the original report from ARROW-16612.~ Cut and moved to [ARROW-17221](https://issues.apache.org/jira/browse/ARROW-17221) for future consideration.
* Likewise for `read_feather()` and `write_feather()`, which also support compression within the file itself and not around it.
* Since the whole "detect compression and wrap in a compressed stream" feature seems limited to CSV and JSON, and in making the changes here I was having to hack around that feature, I refactored to pull it out of the internal functions `make_readable_file()` and `make_output_stream()` and do it only in the csv/json functions.
* In the process of refactoring, I noticed and fixed two bugs: (1) no matter what compression extension you provided to `make_output_stream()`, you would get a gzip-compressed stream because we weren't actually passing the codec to `CompressedOutputStream$create()`; (2) `.lz4` actually needs to be mapped to the "lz4_frame" codec; attempting to write a CSV to a `CompressedOutputStream$create(codec = "lz4")` raises an error. Neither were caught because our tests for this feature only tested gzip.
* The refactoring should also mean that ARROW-16619 (inferring compression from URL), as well as from SubTreeFileSystem (S3 buckets etc.), is also supported.

Authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>