Public / onnxruntime / c8399a81fed

Commits

Xavier Dupré authored and GitHub committed c8399a81fed13 Jan 2024
Quantization tool: support float 8 with MatMul, support float 16 weights (#18043)

### Description

Whenever a node QuantizeLinear or DequantizeLinear, the type of the
weights before being quantize must be known to create the scale with the
expected type. Another option would be to add many operator CastLike but
that would push the burden to onnxruntime optimizer.

The PR tries to avoid changing the signature. To do so, it modified the
scale computation to use a numpy array to store the result and not a
python float. The numpy array must be of the same type than the weights
to quantize.

The PR adds many `assert` to check the type of the scale is not a python
type or a float64. This was added to make sure all the code follows the
same logic. These lines were kept for the first review.

DequantizeLinear, QuantizeLinear cannot be tested with onnx==1.15. PR
https://github.com/onnx/onnx/pull/5709 is missing to fix shape
inference. PR https://github.com/onnx/onnx/pull/5473) is missing to
support QLinearMatMul with float 16. That explains why some tests are
disabled with float 16.

### Motivation and Context

The current quantization tool assumes every weight is float 32. For
large models such as LLAMA, it is usually float 16. The quantization
needs to quantize such weights.