Public / onnxruntime / 178f7caaebf

Commits

aciddelgado authored and GitHub committed 178f7caaebf02 Nov 2023

GQA Memory Efficient Kernel (#17920)

Implement Cutlass Memory Efficient Attention Kernel into Group Query
Attention Operator.

### Motivation and Context
Before this change, Group Query Attention Operator was supported only by
Flash-Attention. While this is the most efficient kernel for the
operation, it only supports sm >= 80. Cutlass Memory Efficient Attention
Kernel supports sm >= 53, allowing us to support a broader range of GPU
hardware.