Commits


Yufeng Li authored and GitHub committed 8c5db7f9734
use legacy stream mode (#2076) In ORT, there is only 3 cuda stream: default, HtoD, DtoH. And both HtoD and DtoH are non-blocking stream. Thus, per-thread stream mode doesn't have any benefit. I also tried in multiple thread env and the legacy mode is also better than per-thread model. Below is the perf of a 3 layer bert on v100. Unit is ms: batch size 1: concurrency | c=1 | c=2 | c=4 legacy | 0.54 | 1.17 | 2.68 per-thread | 0.66 | 1.37 | 2.86 batch size 4: concurrency | c=1 | c=2 | c=4 legacy | 1.1 | 2.22 | 4.6 per-thread | 1.21 | 2.44 | 4.98 batch size 64: concurrency | c=1 | c=2 | c=4 legacy | 8.09 | 16.13 | 32.37 per-thread | 8.18 | 16.26 | 32.45