Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
Why does my RNN produce very unstable predictions for longer sequences?
This happens because standard RNNs suffer from vanishing and exploding gradients on long sequences. As the sequence grows, important signals either fade out or blow up, making learning unstable. That is why LSTM and GRU were created. Switch to LSTM or GRU layers and use gradient clipping: torch.nn.uRead more
This happens because standard RNNs suffer from vanishing and exploding gradients on long sequences.
As the sequence grows, important signals either fade out or blow up, making learning unstable. That is why LSTM and GRU were created.
Switch to LSTM or GRU layers and use gradient clipping:
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
Common mistakes:
Using vanilla RNNs for long text
Not clipping gradients
Too long sequences without truncation
The practical takeaway is that plain RNNs are not designed for long-term memory.
See lessWhy does my CNN predict only one class no matter what image I give it?
This happens when the model has collapsed to predicting the most dominant class in the dataset. If one class appears much more often than others, the CNN can minimize loss simply by always predicting it. This gives decent training accuracy but useless predictions. Check your class distribution. If iRead more
This happens when the model has collapsed to predicting the most dominant class in the dataset.
If one class appears much more often than others, the CNN can minimize loss simply by always predicting it. This gives decent training accuracy but useless predictions.
Check your class distribution. If it is skewed, use class weighting or balanced sampling:
loss = nn.CrossEntropyLoss(weight=class_weights)
Also verify that your labels are correctly aligned with your images.
Common mistakes:
Highly imbalanced datasets
Shuffled images but not labels
Incorrect label encoding
The practical takeaway is that class imbalance silently trains your CNN to cheat.
See lessWhy does my image classifier have very high training accuracy but terrible test accuracy?
This happens because the model is overfitting to the training data. The network is learning specific pixel patterns instead of general features, so it performs well only on images it has already seen. You need to increase generalization by adding data augmentation, dropout, and regularization: transRead more
This happens because the model is overfitting to the training data.
The network is learning specific pixel patterns instead of general features, so it performs well only on images it has already seen.
You need to increase generalization by adding data augmentation, dropout, and regularization:
transforms.RandomHorizontalFlip()
transforms.RandomRotation(10)
Also reduce model complexity or add weight decay in the optimizer.
Common mistakes:
Training on small datasets
Using too many layers
Not shuffling data
The practical takeaway is that high training accuracy without test accuracy means your CNN is memorizing, not understanding.
See lessWhy does my Transformer run out of GPU memory only during text generation?
This happens because Transformer models store attention history during generation, which makes memory usage grow with every generated token. During training, the sequence length is fixed. During generation, the model keeps cached key-value tensors for all previous tokens, so memory usage increases aRead more
This happens because Transformer models store attention history during generation, which makes memory usage grow with every generated token.
During training, the sequence length is fixed. During generation, the model keeps cached key-value tensors for all previous tokens, so memory usage increases at each step. This can easily exceed what training required.
You should disable unnecessary caches and limit generation length:
model.config.use_cache = False
outputs = model.generate(input_ids, max_new_tokens=128)
Also make sure inference runs in evaluation mode with gradients disabled:
model.eval()
with torch.no_grad():
...
Using half-precision (
model.half()) can also significantly reduce memory usage.Common mistakes:
Allowing unlimited generation length
Forgetting
torch.no_grad()Using training batch sizes during inference
The practical takeaway is that Transformers consume more memory while generating than while training.
See lessWhy does my classifier become unstable after fine-tuning on new data?
This happens because of catastrophic forgetting. When fine-tuned on new data, neural networks overwrite weights that were important for earlier knowledge. Without constraints, gradient updates push the model to fit the new data at the cost of old patterns. This is especially common when the new dataRead more
This happens because of catastrophic forgetting. When fine-tuned on new data, neural networks overwrite weights that were important for earlier knowledge.
Without constraints, gradient updates push the model to fit the new data at the cost of old patterns. This is especially common when the new dataset is small or biased.
Using lower learning rates, freezing early layers, or mixing old and new data during training reduces this problem.
See lessWhy does my training crash when I increase sequence length in Transformers?
This happens because Transformer memory grows quadratically with sequence length. Attention layers store interactions between all token pairs. Long sequences rapidly exceed GPU memory, even if batch size stays the same. The practical takeaway is that Transformers are limited by attention scaling, noRead more
This happens because Transformer memory grows quadratically with sequence length. Attention layers store interactions between all token pairs.
Long sequences rapidly exceed GPU memory, even if batch size stays the same.
The practical takeaway is that Transformers are limited by attention scaling, not just model size.
See less