Commits


Tianlei Wu authored and GitHub committed 83547d30672
[CUDA] Fix SkipLayerNorm vectorized kernel out-of-bounds read (#17943) Fix a bug in https://github.com/microsoft/onnxruntime/pull/11803: When hidden size is not exactly same as next size (for example ld=320 in stable diffusion) current vectorized kernel might read out-of-bounds, and might cause CUDA failure. Also resolved another issue: for the first and last size, current macro will cause some dead code (some branch will never run). Here we change it to avoid those branches in boundary sizes. Performance tests with stable diffusion shows that the performance is on-par before/after this fix.