Home/AI & Machine Learning/Page 3
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
Why does my fine-tuned LLM perform worse than the base model?
This happens when fine-tuning introduces noise or bias that overwrites useful pretrained knowledge. The most frequent cause is low-quality or inconsistent fine-tuning data. If your dataset is small, poorly labeled, or stylistically narrow, the model may over-specialize and lose general reasoning abiRead more
This happens when fine-tuning introduces noise or bias that overwrites useful pretrained knowledge.
The most frequent cause is low-quality or inconsistent fine-tuning data. If your dataset is small, poorly labeled, or stylistically narrow, the model may over-specialize and lose general reasoning ability.
Another common issue is using an aggressive learning rate. Large updates can destroy pretrained representations in just a few steps.
To fix this, reduce the learning rate significantly and limit the number of trainable parameters using techniques like LoRA or partial layer freezing. Always evaluate against a held-out baseline prompt set to detect regression early.
Common mistakes:
Fine-tuning on fewer than a few thousand high-quality samples
Not validating against base model outputs
Training for too many epochs
Fine-tuning should nudge behavior, not replace core knowledge.
See lessWhy does my retrained model perform worse on old data?
This is a classic case of catastrophic forgetting. When retraining only on recent data, the model adapts to new patterns while losing performance on older distributions. This is common in incremental learning setups. To fix it, mix a representative sample of historical data into retraining or use reRead more
This is a classic case of catastrophic forgetting.
When retraining only on recent data, the model adapts to new patterns while losing performance on older distributions. This is common in incremental learning setups.
To fix it, mix a representative sample of historical data into retraining or use rehearsal techniques. Regularization toward previous weights can also help.
Common mistakes:
Training only on the latest data window
Assuming more recent data is always better
Dropping legacy edge cases
Retraining should expand knowledge, not replace it.
See lessWhat causes NaN losses during model training?
NaNs usually come from invalid numerical operations. Common sources include division by zero, log of zero, exploding gradients, or invalid input values. In deep models, this often appears after a few unstable updates. Start by enabling gradient clipping and lowering the learning rate. Then check youRead more
NaNs usually come from invalid numerical operations.
Common sources include division by zero, log of zero, exploding gradients, or invalid input values. In deep models, this often appears after a few unstable updates.
Start by enabling gradient clipping and lowering the learning rate. Then check your input data for NaNs or infinities before it enters the model.
If using mixed precision, confirm loss scaling is enabled correctly.
Common mistakes:
Normalizing with zero variance features
Ignoring data validation
Training with unchecked custom loss functions
NaNs are symptoms—fix the instability, not the symptom.
See lessWhy does my model pass offline tests but fail A/B experiments?
Offline metrics often fail to capture real user behavior. In production, user interactions introduce feedback loops, latency constraints, and distribution shifts that static datasets don’t reflect. A model may optimize for offline accuracy but degrade user experience. Instrument live metrics and anaRead more
Offline metrics often fail to capture real user behavior.
In production, user interactions introduce feedback loops, latency constraints, and distribution shifts that static datasets don’t reflect. A model may optimize for offline accuracy but degrade user experience.
Instrument live metrics and analyze segment-level performance. Often the failure is localized to specific cohorts or edge cases.
Common mistakes:
Relying on a single offline metric
Ignoring latency and timeouts
Deploying without gradual rollout
Offline success is necessary but never sufficient.
See lessHow can prompt length cause unexpected truncation?
LLMs have strict context length limits. If system messages, instructions, and user input exceed this limit, earlier tokens are dropped silently. This often removes critical instructions. Always calculate token usage explicitly and reserve space for the response. Truncate user input, not system prompRead more
LLMs have strict context length limits.
If system messages, instructions, and user input exceed this limit, earlier tokens are dropped silently. This often removes critical instructions.
Always calculate token usage explicitly and reserve space for the response. Truncate user input, not system prompts.
Common mistakes:
Assuming character count equals token count
Appending logs or history blindly
Ignoring model-specific context limits
Context budgeting is essential for reliable prompting.
See lessWhy does my inference latency increase after model optimization?
Some optimizations improve throughput but hurt single-request latency. Batching, quantization, or graph compilation can introduce overhead that only pays off at scale. In low-traffic scenarios, this overhead dominates. Profile latency at realistic request rates and choose optimizations accordingly.Read more
Some optimizations improve throughput but hurt single-request latency.
Batching, quantization, or graph compilation can introduce overhead that only pays off at scale. In low-traffic scenarios, this overhead dominates. Profile latency at realistic request rates and choose optimizations accordingly.
Common mistakes:
Optimizing without workload profiling
Using batch inference for real-time APIs
Ignoring cold-start costs
Optimize for your actual deployment context.
See lessHow do I detect silent label leakage during training?
Label leakage occurs when future or target information sneaks into input features. This often happens through timestamp misuse, aggregated features, or improperly joined datasets. The model appears highly accurate but fails in production. Audit features for causal validity and simulate prediction usRead more
Label leakage occurs when future or target information sneaks into input features.
This often happens through timestamp misuse, aggregated features, or improperly joined datasets. The model appears highly accurate but fails in production. Audit features for causal validity and simulate prediction using only information available at inference time.
Common mistakes:
Using post-event aggregates
Joining tables without time constraints
Trusting unusually high validation scores
If performance seems too good, investigate.
See lessWhy does my model’s accuracy fluctuate wildly between training runs?
Non-determinism is the usual culprit. Random initialization, data shuffling, parallelism, and GPU kernels all introduce variance. Without controlled seeds, results will differ. Set seeds across libraries and disable non-deterministic operations where possible. Expect some variance, but large swingsRead more
Non-determinism is the usual culprit.
Random initialization, data shuffling, parallelism, and GPU kernels all introduce variance. Without controlled seeds, results will differ.
Set seeds across libraries and disable non-deterministic operations where possible. Expect some variance, but large swings indicate instability.
Common mistakes:
Setting only one random seed
Comparing single-run results
Ignoring hardware differences
Reproducibility requires deliberate configuration
See lessWhat causes “CUDA out of memory” errors even with a small batch size?
This usually happens because memory is being accumulated across iterations rather than freed correctly. The most common cause is storing computation graphs unintentionally, often by appending loss tensors or model outputs to a list without detaching them. Over time, GPU memory fills up regardless ofRead more
This usually happens because memory is being accumulated across iterations rather than freed correctly.
The most common cause is storing computation graphs unintentionally, often by appending loss tensors or model outputs to a list without detaching them. Over time, GPU memory fills up regardless of batch size.
Make sure you call
optimizer.zero_grad()every iteration and avoid saving tensors that require gradients. If you need to log values, convert them to scalars using.item().In transformer workloads, sequence length matters more than batch size. A batch of 2 with long sequences can exceed memory limits faster than a batch of 16 with shorter inputs.
Common mistakes:
Forgetting
torch.no_grad()during evaluationLogging full tensors instead of scalars
Increasing max token length without adjusting batch size
Monitoring GPU memory with a profiler will usually reveal the leak within a few iterations.
See lessWhy does my deployed model slowly become biased toward one class over time?
This usually happens when feedback loops in production reinforce certain predictions more than others. In many real systems, model outputs influence the data collected next. If one class is shown or acted upon more often, future training data becomes skewed toward that class. Over time, the model apRead more
This usually happens when feedback loops in production reinforce certain predictions more than others.
In many real systems, model outputs influence the data collected next. If one class is shown or acted upon more often, future training data becomes skewed toward that class. Over time, the model appears to “prefer” it, even if the original distribution was balanced.
To fix this, monitor class distributions in both predictions and incoming labels. Introduce sampling or reweighting during retraining so minority classes remain represented. In some systems, delaying or decoupling feedback from training helps break the loop.
Common mistakes:
Assuming bias only comes from training data. Retraining on production data without auditing it or monitoring accuracy but not class balance
Models don’t just learn from data — they learn from the systems around them.
See less