Public / onnxruntime / 2d33ee91556

Commits

Jing Fang authored and GitHub committed 2d33ee9155621 Feb 2025
[ARM CPU] Enable FP16 kernels for GQA op  (#23746)

### Description
 - Enable hgemm and softmax fp16 kernels for GQA
 - add intra-loop parallelism to RoPE fp16 kernel

__Benchmarking models__
- float32: [phi-3 cpu accuracy level
0](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-onnx/tree/main/cpu_and_mobile/cpu-int4-rtn-block-32)
- float16: [phi-3 gpu accuracy level
0](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-onnx/tree/main/cuda/cuda-int4-rtn-block-32)

Note: 
- Both fp32 and fp16 models share the same model structure and operator
settings.
- GQA takes ~15% of the runtime.
- prompt length 256, token generation length 512

Linux (ubuntu 24.04) Standard D16pls v5 (16 vcpus, 32 GiB memory)
| | fp32 (tps) | old fp16 (tps) | new fp16 (tps) | new fp16 vs old fp16
| new fp16 vs fp32 |
|--|--|--|--|--|--|
| prompt processing | 31.22 | 44.24 | 46.29 | +4.6% | +48.25% |
| token generation | 4.75  | 7.2 | 7.95 | +10.39% | +67.43% |

### Motivation and Context
Speed up GQA on FP16