Commits


kunal-vaishnavi authored and GitHub committed 72a3bde3305
Add GQA on CPU in LLaMA scripts (#20720) ### Description This PR adds support for adding GroupQueryAttention (GQA) in models that are running on CPU. ### Motivation and Context Previously, the LLaMA scripts supported creating models that have GQA for CUDA only. With the recently added support for [GQA on CPU](https://github.com/microsoft/onnxruntime/pull/20299), models where `num_attention_heads != num_key_value_heads` can now use the GQA op and [run much faster on CPU](https://github.com/microsoft/onnxruntime/pull/20598).