Public / onnxruntime / 012b34dc4e6

Commits

Tianlei Wu authored and GitHub committed 012b34dc4e612 Jan 2023

Add --use_multi_head_attention in transformers fusion (#14198)

Add an option --use_multi_head_attention to fuse model with
MultiHeadAttention operator instead of Attention operator for testing
purpose.

Note that MultiHeadAttention can be used in self-attention and
cross-attention, while Attention operator is used for self-attention
only. In Attention operator, there is packed Q/K/V weights for input
projection, but that MatMul of input projection is excluded from
MultiHeadAttention.