Public / onnxruntime / 6817b013b9d

Commits

Jing Fang authored and GitHub committed 6817b013b9d20 Jun 2024
[MLAS] add q4 quantize and transpose kernel to support MatMulNBits QDQ fuse (#21054)

### Description

1. added kernel to quantize matmul B tensor to q4, and store in the same
shape as original tensor. scales and zero points are calculated as well.
scales and zero points have the same shape.
2. added kernel to transpose q4 B tensor to B tensor in MatMulNBits.
Scales and zero points are transposed as well.

####
Benchmark
<1024 x 4096 input, 64 quant block, 8 threads>: 
 - quantize: 23035923 ns
 - transpose: 718635 ns

<1024 x 4095 input, 64 quant block, 8 threads>: 
 - quantize: 26759319 ns
 - transpose: 1279064 ns

### Motivation and Context
The MatMulNbits tool chain current only supports converting a MatMul op
direct to MatMulNBits op. MatMulNbits op is not an ONNX standard op.
Therefore, we need the tool chain to support converting MatMul to Q/DQ
format, and later in the transform step converts DQ + MatMul to
MatMulNBits. The tensors stored in DQ are the quantized constants and
will be stored in the MatMulNBits.