I am training a convolutional neural network on a custom image dataset using PyTorch.
For the first few batches the loss looks normal, but suddenly it becomes NaN and never recovers.
There are no crashes or stack traces, only the training metrics become meaningless.
I have tried restarting training but the same thing keeps happening every time.
Why does my CNN suddenly start giving NaN loss after a few training steps?
Abhimanyu SinghBegginer
This happens because invalid numerical values are entering the network, usually from broken data or unstable gradients.
In CNN pipelines, a single corrupted image, division by zero during normalization, or an aggressive learning rate can inject
inforNaNvalues into the forward pass. Once that happens, every layer after it propagates the corruption and the loss becomes undefined.Start by checking whether any batch contains bad values:
if torch.isnan(images).any() or torch.isinf(images).any():
print("Invalid batch detected")
Make sure images are converted to floats and normalized only once, for example by dividing by 255 or using mean–std normalization. If the data is clean, reduce the learning rate and apply gradient clipping:
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
Mixed-precision training can also cause this, so disable AMP temporarily if you are using it.