Commits


Vincent Wang authored and GitHub committed 2a11f29eaa7
[CUDA] Optimize BiasGelu/BiasGeluGrad Kernel (#16608) The PR optimizes BiasGelu/BiasGeluGrad CUDA kernel by 3 changes: - Use Erf instead of Normcdf for half compute - Change CUDA thread organization for BiasGelu kernel instead of using binary elementwise template - Add vectorized support Using BiasGelu(A[256, 128, 768] + B[768]) in V100 as example, the perf number below are in us Before change, FW: 246.37, BW: 292.77 Use Erf, FW: 152.86, BW: 238.98 All above changes, FW: 132.45, BW: 199.14 For Huggingface's bertweet-base model, with the changes, the step time (FW+BW) reduces from 324.71766 ms to 316.42552 ms, which is 1.026x faster. Using Erf is for half data only, evaluation shows that for float on CUDA, Normcdf is faster. I didn't check the perf for BFloat16 or on AMD, so keep them unchanged.