My model uses both image and text inputs.
It works well when both are provided.
If one modality is missing, outputs become random or broken.
Real-world data is often incomplete.
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
This happens because the model was never trained to handle missing modalities. During training, it learned to rely on both image and text features simultaneously, so removing one breaks the learned representations.
Neural networks do not automatically know how to compensate for missing data. If every training example contains all inputs, the model assumes they will always be present and builds internal dependencies around them.
To fix this, you must train the model with masked or dropped modalities so it learns to fall back on whatever information is available. This is standard practice in robust multimodal systems.
Common mistakes:
Training only on complete data
No modality dropout
Assuming fusion layers are adaptive
The practical takeaway is that multimodal robustness must be trained explicitly.