Public / arrow / 423ca163a26

Commits

William Butler authored and Micah Kornfield committed 423ca163a2607 Jul 2022
PARQUET-2163:  Handle decimal schemas with large fixed_len_byte_arrays

The precision calculation had been overflowing to infinity when the
length of the fixed_len_byte_array > 128, triggering an error when then
trying to convert infinity to an int32. We can actually simplify the
logic by noting that log_b(a^(x)) = log_b(a)*x. This avoids the
intermediate infinity. We also added a check for extremely large value
sizes implying a max precision that cannot fit in int32. Even 129 byte
decimal seems extreme.

The formula Parquet C++ was using is technically incorrect vs the
Parquet specification. The specification says that the max precision is
floor(log_10(2^(B*8 -1) - 1)), where the C++ implementation was omitting the
outer -1. However, this is okay as it is easy to prove that these values
will always be the same (ignoring the realities of FP arithmetic) & in
practice all three formulas agree through 128 when using FP.

Bug found through fuzzing.

Closes #13456 from tachyonwill/float_overflow

Authored-by: William Butler <wab@google.com>
Signed-off-by: Micah Kornfield <emkornfield@gmail.com>