CUDA
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
This usually happens because memory is being accumulated across iterations rather than freed correctly.
The most common cause is storing computation graphs unintentionally, often by appending loss tensors or model outputs to a list without detaching them. Over time, GPU memory fills up regardless of batch size.
Make sure you call
optimizer.zero_grad()every iteration and avoid saving tensors that require gradients. If you need to log values, convert them to scalars using.item().In transformer workloads, sequence length matters more than batch size. A batch of 2 with long sequences can exceed memory limits faster than a batch of 16 with shorter inputs.
Common mistakes:
Forgetting
torch.no_grad()during evaluationLogging full tensors instead of scalars
Increasing max token length without adjusting batch size
Monitoring GPU memory with a profiler will usually reveal the leak within a few iterations.