Commits


Yufeng Li authored and GitHub committed 90d1f537cb3
optimize SLN with large dimension (#18138) ### Description <!-- Describe your changes. --> Optimize SkipLayerNorm for large dimension (>=2048) by handling 8 elements in one thread. It avoid the re-writing and re-loading sum of input, skip and bias to main memory. It reduces the latency of dimension 4096 with small batch size from ~18us to ~3.8us on A100. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->