Short sequences work fine.
Longer sequences cause GPU crashes.
No code changes were made.
Only input size increased.
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
This happens because Transformer memory grows quadratically with sequence length. Attention layers store interactions between all token pairs.
Long sequences rapidly exceed GPU memory, even if batch size stays the same.
The practical takeaway is that Transformers are limited by attention scaling, not just model size.