Commits


Tianlei Wu authored and GitHub committed 944bff0ad64
Support two stages onnx GPT-2 conversion (#14025) ### Description Add support of ONNX conversion of GPT-2 for two stages: * Stage 1 is the initial stage that has empty past state. * Stage 2 has non-empty past state and sequence_length is 1. Add a parameter --stage to specify such stage. For stage 1, we will enable mask_index for Attention so that we can use fused attention in CUDA. Other changes: (1) use int32 inputs as default (otherwise, there is error in inference) (2) update gpt2_parity to include SkipLayerNormalization (see https://github.com/microsoft/onnxruntime/pull/13988) and EmbedLayerNormalization (3) get all environment variables that might impact GPT-2 latency in benchmark_gpt2 ### Motivation and Context To test fused attention for GPT-2 model for https://github.com/microsoft/onnxruntime/pull/13953.