Commits


Ye Wang authored and GitHub committed 342a5bf2b75
Improve rpb cuda kernel (#14195) ### Description Average latency (ms) of float16 relative position bias cuda kernel on V100: Kernel\Seq_Len | 16 | 32 | 64 | 128 | 256 | 384 | 512 | 768 | 1024 | 2048 | 4096 -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- Before| 0.0494 | 0.0654 | 0.1519 | 0.4322 | 1.1865 | 2.4091 | 4.3676 | 14.912 | 36.517 | 142.09 | 561.80 After | 0.0483 | 0.0651 | 0.1294 | 0.3858 | 1.1128 | 2.2988 | 3.8391 | 14.290 | 34.542 | 136.13 | 529.54 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Review of this comment https://github.com/microsoft/onnxruntime/pull/14149/#discussion_r1063152021 Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>