Commits


Weston Pace authored and Benjamin Kietzman committed 8e8a0009c44
ARROW-10438: [C++][Dataset] Partitioning::Format on nulls Tested and added support for partitioning with nulls. I had to make some changes to the hash kernels. You can now specify how you want DictionaryEncode to treat nulls. The MASK option will continue the current behavior (null not in dictionary, null value in indices) and the ENCODE option will put `null` in the dictionary and there will be no null values in the indices array. Partitioning on nulls will depend on the partitioning scheme. For directory partitioning null is allowed on inner fields but it is not allowed on an outer field if an inner field is defined. In other words, if the schema is a(int32), b(int32), c(int32) then the following are allowed ``` / (a=null, b=null, c=null) /32 (a=32, b=null, c=null) /32/57 (a=32, b=57, c=null) ``` There is no way to specify `a=null, b=57, c=null`. This does mean that partition directories can contain a mix of files and nested partition directories (e.g. /32 might contain file.parquet and the directory /57). Alternatively we could just forbid nulls in the directory partitioning scheme. For the hive scheme we need to be compatible with other tools that read/write hive. Those tools use a fallback value which defaults to `__HIVE_DEFAULT_PARTITION__`. So by default you would have directories that look like... ``` /a=__HIVE_DEFAULT_PARTITION__/b=__HIVE_DEFAULT_PARTITION__/c=__HIVE_DEFAULT_PARTITION__ ``` The null fallback value is configurable as a string passed to HivePartitioning::HivePartitioning or HivePartitioning::MakeFactory. ARROW-11649 has been created for extending this null fallback configuration to R. Closes #9323 from westonpace/feature/arrow-10438 Lead-authored-by: Weston Pace <weston.pace@gmail.com> Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>