Commits


Jing Fang authored and GitHub committed 2d33ee91556
[ARM CPU] Enable FP16 kernels for GQA op (#23746) ### Description - Enable hgemm and softmax fp16 kernels for GQA - add intra-loop parallelism to RoPE fp16 kernel __Benchmarking models__ - float32: [phi-3 cpu accuracy level 0](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-onnx/tree/main/cpu_and_mobile/cpu-int4-rtn-block-32) - float16: [phi-3 gpu accuracy level 0](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-onnx/tree/main/cuda/cuda-int4-rtn-block-32) Note: - Both fp32 and fp16 models share the same model structure and operator settings. - GQA takes ~15% of the runtime. - prompt length 256, token generation length 512 Linux (ubuntu 24.04) Standard D16pls v5 (16 vcpus, 32 GiB memory) | | fp32 (tps) | old fp16 (tps) | new fp16 (tps) | new fp16 vs old fp16 | new fp16 vs fp32 | |--|--|--|--|--|--| | prompt processing | 31.22 | 44.24 | 46.29 | +4.6% | +48.25% | | token generation | 4.75 | 7.2 | 7.95 | +10.39% | +67.43% | ### Motivation and Context Speed up GQA on FP16