Commits


Jing Fang authored and GitHub committed 6817b013b9d
[MLAS] add q4 quantize and transpose kernel to support MatMulNBits QDQ fuse (#21054) ### Description 1. added kernel to quantize matmul B tensor to q4, and store in the same shape as original tensor. scales and zero points are calculated as well. scales and zero points have the same shape. 2. added kernel to transpose q4 B tensor to B tensor in MatMulNBits. Scales and zero points are transposed as well. #### Benchmark <1024 x 4096 input, 64 quant block, 8 threads>: - quantize: 23035923 ns - transpose: 718635 ns <1024 x 4095 input, 64 quant block, 8 threads>: - quantize: 26759319 ns - transpose: 1279064 ns ### Motivation and Context The MatMulNbits tool chain current only supports converting a MatMul op direct to MatMulNBits op. MatMulNbits op is not an ONNX standard op. Therefore, we need the tool chain to support converting MatMul to Q/DQ format, and later in the transform step converts DQ + MatMul to MatMulNBits. The tensors stored in DQ are the quantized constants and will be stored in the MatMulNBits.