Public / onnxruntime / 72a3bde3305

Commits

kunal-vaishnavi authored and GitHub committed 72a3bde330518 May 2024

Add GQA on CPU in LLaMA scripts (#20720)

### Description
This PR adds support for adding GroupQueryAttention (GQA) in models that
are running on CPU.

### Motivation and Context
Previously, the LLaMA scripts supported creating models that have GQA
for CUDA only. With the recently added support for [GQA on
CPU](https://github.com/microsoft/onnxruntime/pull/20299), models where
`num_attention_heads != num_key_value_heads` can now use the GQA op and
[run much faster on
CPU](https://github.com/microsoft/onnxruntime/pull/20598).