Commits


Tianlei Wu authored and GitHub committed 1866a9d8187
Use the lowest float for causal mask (#16369) Always set causal mask to the lowest float. Note that since huggingface transformers v4.21, gpt2 uses lowest half for FP16, and lowest float for FP32: https://github.com/huggingface/transformers/blob/66fd3a8d626a32989f4569260db32785c6cbf42a/src/transformers/models/gpt2/modeling_gpt2.py#L199 Assume that most fp16 ONNX models are converted from fp32 models. We decided to use lowest float32 for both half and float model for consistency. The mask_filter_value only applies to raw attention mask (2D, 3D or 4D). For 1D mask, masked item is 0.0 after softmax so mask filter value is the lowest float for 1D mask. * For BERT model, when users use 1D mask (required by FMHA) and mask_filter_value is not applicable. * For BERT or GPT-2, when fused kernel is used, mask_filter_value has no impact ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> https://github.com/microsoft/onnxruntime/issues/12843 https://github.com/microsoft/onnxruntime/issues/14363