Public / onnxruntime / 76dfe8108b4

Commits

Zhang Lei authored and GitHub committed 76dfe8108b412 Aug 2021

Optimize quantized LSTM (#8634)

* optimize some lstm gate computation. Remove no need string constructions.

* change gcc optimization flags for computation bound logics in rnn_helpers

* better qgemm for M=1

* Some improve on avx512

* add condition to limit GCC related marcros

* Correct QGemm assembly for M=1 AVX2 optimization to pass mlas_test.

* Fix rnn_helper build issue for wasm.

* better asm code here according to feedbacks.

* Remove customized vectorize and unroll option for GCC.
Using restrict on some function to help GCC to correctly vectorize it.
Rewrite clip_add_bias() to let GCC correctly vectorize it.

* Better restrict semantic for merge_lstm_gates_to_memory() by adding in_place().
Add MSC __restrict for the clip_add_bias() mthod to vectorize correctly.

* Force CI restart as it stucked by the onnxruntime-python-checks-ci-pipeline which can not restart.