Commits


liqun Fu authored and GitHub committed af04b202baf
Rope imbedding kernel to use avx2 (#23694) ### Description <!-- Describe your changes. --> Credit to [chethanpk](https://github.com/chethanpk) who provided with Rope Embedding in a patch. The patch is in the first commit of this PR. I have been confirming perf improvement with this code change. My analysis is based on phi-3-mini-4k-instruct-int4-int8-blklen32. Benchmark from onnxruntim-genai does not show clear improvement. this is because GQA only takes a small portion of the whole model (<10%) and Rope within GQA only take small portion of the whole GQA (12%). The following is the profile with and without avx2 we see cost of RoPE dropped from 82.42 to 18.86. Therefore I still recommend to merge this PR. with avx2 RoPE: Name: GroupQueryAttention_rotary, Mean Duration: 18.86, Percentage: 3.16% plain c++ RoPE: Name: GroupQueryAttention_rotary, Mean Duration: 82.42, Percentage: 12.20% mlas benchmark: dim|interleaved|baseline|new -|-|-|- 128 |false|735|18.1 256 |false|1470|31.7 512 |false|2938|59.2 1024 |false|5876|81.5 128 |true|368|23.1 256 |true|735|34.3 512 |true|1470|62.0 1024 |true|2937|125 --------- Signed-off-by: Liqun Fu <liqun.fu@microsoft.com> Signed-off-by: liqunfu <liqun.fu@microsoft.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>