I rerun the same experiment multiple times.
Metrics fluctuate even with identical settings.
This makes comparisons unreliable.
I’m not sure what to trust.
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
This is often caused by uncontrolled randomness in the pipeline. Random seeds affect data splits, model initialization, and even parallel execution order. If seeds aren’t fixed consistently, results will vary.
Set seeds for all relevant libraries and document them as part of the experiment. Also check whether data ordering or sampling changes between runs. In distributed environments, nondeterminism can still occur due to hardware or parallelism, so expect small variations.
Common mistakes include: Setting a seed in only one library, Assuming deterministic behavior by default and Comparing runs across different environments
The takeaway is that reproducibility requires intentional control, not assumptions.