Salesforce BRE is a centralized decision engine where rules are…

Question

Asked: December 22, 20252025-12-22T12:38:26+00:00 2025-12-22T12:38:26+00:00In: Deep Learning

Why does my Transformer run out of GPU memory only during text generation?

I fine-tuned a Transformer model without any memory issues.
But when I call model.generate(), CUDA runs out of memory.
This happens even for short prompts.
Training worked fine, so this feels confusing.

Leave an answer

Leave an answer
Cancel reply

1 Answer

Herbert Schmidt · Answer 1 · 2026-01-14T16:30:48+00:00

This happens because Transformer models store attention history during generation, which makes memory usage grow with every generated token.

During training, the sequence length is fixed. During generation, the model keeps cached key-value tensors for all previous tokens, so memory usage increases at each step. This can easily exceed what training required.

You should disable unnecessary caches and limit generation length:

Also make sure inference runs in evaluation mode with gradients disabled:

Using half-precision (model.half()) can also significantly reduce memory usage.

Common mistakes:

Allowing unlimited generation length
Forgetting torch.no_grad()
Using training batch sizes during inference

The practical takeaway is that Transformers consume more memory while generating than while training.

Why does zero-trust adoption face internal resistance?

Why does my CI job randomly fail with timeout errors?

Why does my API leak internal details through error messages?

Akshay Kumar

Aaditya Singh

Abhimanyu Singh

Sign Up

Sign In

Forgot Password

Decode Trail Latest Questions

Why does my Transformer run out of GPU memory only during text generation?

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply