Public / onnxruntime / 4ac98d6d656

Commits

kunal-vaishnavi authored and GitHub committed 4ac98d6d65614 Mar 2024

Update replacing MultiHeadAttention with GroupQueryAttention (#19882)

### Description
This PR updates the replacement of MultiHeadAttention (MHA) with
GroupQueryAttention (GQA). It is related to the changes in [this
PR](https://github.com/microsoft/onnxruntime/pull/18906).

### Motivation and Context
The updated replacement of MHA with GQA includes the following fusion
changes.
- Apply sliding window within GQA
- Fuse the rotary embeddings within GQA
- Fuse the 3 MatMuls into 1 packed MatMul if possible
- Fuse the 3 Adds into 1 packed Add if possible