Public / onnxruntime / 2a11f29eaa7

Commits

Vincent Wang authored and GitHub committed 2a11f29eaa707 Jul 2023

[CUDA] Optimize BiasGelu/BiasGeluGrad Kernel (#16608)

The PR optimizes BiasGelu/BiasGeluGrad CUDA kernel by 3 changes:
- Use Erf instead of Normcdf for half compute
- Change CUDA thread organization for BiasGelu kernel instead of using
binary elementwise template
- Add vectorized support

Using BiasGelu(A[256, 128, 768] + B[768]) in V100 as example, the perf
number below are in us
Before change, FW: 246.37, BW: 292.77
Use Erf, FW: 152.86, BW: 238.98
All above changes, FW: 132.45, BW: 199.14

For Huggingface's bertweet-base model, with the changes, the step time
(FW+BW) reduces from 324.71766 ms to 316.42552 ms, which is 1.026x
faster.

Using Erf is for half data only, evaluation shows that for float on
CUDA, Normcdf is faster. I didn't check the perf for BFloat16 or on AMD,
so keep them unchanged.