Commits


kunal-vaishnavi authored and GitHub committed 44d8ad93b20
Whisper Timestamps and Temperature (#19509) ### Description This PR updates exporting and running the Whisper model with beam search by adding the following. - Adds temperature as a graph input to the exported model - Fixes the token ids by adding them as attributes to `WhisperBeamSearch` - Fixes the timestamps test cases so they pass now - Fixes a bug with invoking `torch.onnx.export` - Cleans up the Whisper scripts and groups the arguments in `convert_to_onnx.py` - Adds a `requirements.txt` file to specify package dependencies - Adds `whisper-large-v3` to list of pretrained models - Fixes a bug with missing cross-attention KV cache inputs in the decoder subgraph ### Motivation and Context - This is a follow-up to [this PR](https://github.com/microsoft/onnxruntime/pull/19188). - The incorrect token ids in the timestamps processor were first noticed during [this PR review](https://github.com/microsoft/onnxruntime/pull/17500#discussion_r1333520007). When they were originally added in [this PR](https://github.com/microsoft/onnxruntime/pull/15853), the offsets were previously constant across the Whisper model sizes. When comparing the new `whisper-large-v3` variant, the English-only variants (e.g. `whisper-tiny.en`), and the original variants (e.g. `whisper-tiny`), both the values and the offsets differ. Therefore, it is easier to set the token ids as attributes to `WhisperBeamSearch` when exporting to ensure the right values are used in the timestamps processor. - The Hugging Face API for returning timestamps and the expected outputs from the PyTorch model have both changed. - The fix for `torch.onnx.export` is a follow-up to [this PR review](https://github.com/microsoft/onnxruntime/pull/17179#issuecomment-1683001470). - The argument grouping is a follow-up to [this PR review](https://github.com/microsoft/onnxruntime/pull/17500#discussion_r1333521721). - Specific package versions are needed to run the Whisper scripts and the `requirements.txt` file ensures that these versions are installed. - The `whisper-large-v3` variant is released and should be in the list of official pretrained models. - After the changes from [this PR](https://github.com/microsoft/onnxruntime/pull/17316), the exported model is not loading in an ORT inference session because the cross-attention KV cache inputs are missing in the decoder subgraph.