Public / onnxruntime / df28c7d73b7

Commits

Adrian Lizarraga authored and GitHub committed df28c7d73b706 Jun 2024

[Quant tool] Improve performance of int4 weight quantization (#20935)

### Description
- Uses our own quantization functions instead of the ONNX reference
implementation of QuantizeLinear when quantizing weights to int4.
- Uses a custom function that packs bytes into 4-bit elements.



### Motivation and Context
Running the quantization tool to create QDQ models with int4 weights
could take up to 7x longer. This PR uses our own quantization and byte
packing utilities to improve performance.

#### Measurements
Model with ~5M parameters to quantize to int4.

- Current implementation: **84.5s**
- Only replace ONNX QuantizeLinear implementation: **50.3s** (1.68x
speedup)
- This PR (replace onnx Q impl, custom packing func): **13.5s** (6.26x
speedup)

---------

Signed-off-by: adrianlizarraga <adlizarraga@microsoft.com>