I fine-tuned a Transformer model without any memory issues.
But when I call model.generate(), CUDA runs out of memory.
This happens even for short prompts.
Training worked fine, so this feels confusing.
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
This happens because Transformer models store attention history during generation, which makes memory usage grow with every generated token.
During training, the sequence length is fixed. During generation, the model keeps cached key-value tensors for all previous tokens, so memory usage increases at each step. This can easily exceed what training required.
You should disable unnecessary caches and limit generation length:
model.config.use_cache = False
outputs = model.generate(input_ids, max_new_tokens=128)
Also make sure inference runs in evaluation mode with gradients disabled:
model.eval()
with torch.no_grad():
...
Using half-precision (
model.half()) can also significantly reduce memory usage.Common mistakes:
Allowing unlimited generation length
Forgetting
torch.no_grad()Using training batch sizes during inference
The practical takeaway is that Transformers consume more memory while generating than while training.