Public / onnxruntime / d1b85f5fb4f

Commits

kunal-vaishnavi authored and GitHub committed d1b85f5fb4f01 Nov 2023

Reduce LLaMA memory usage (#18181)

### Description
This PR reduces the memory usage when exporting and benchmarking LLaMA.



### Motivation and Context
- Exporting: The PyTorch model is deleted from memory after a successful
export instead of deleting it from memory after exporting + converting
the ONNX model to the desired precision.
- Benchmarking: In the ONNX model with GroupQueryAttention, the KV cache
inputs use the same GPU memory for both the prompt and token generation
benchmarks.