Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Username* Please type your username.

E-Mail* Please type your E-Mail.

Question Title* Please choose an appropriate title for the question so it can be answered easily.

Category* Please choose the appropriate section so the question can be searched easily.

Tags Please choose suitable Keywords Ex: question, poll.

Is this question is a poll? If you want to be doing a poll click here.

Image poll?

Featured image

Browse

Details*

Type the description thoroughly and in details.

Ask Anonymously

Add a Video to describe the problem better.

Video type Choose from here the video type.

Video ID Put Video ID here: https://www.youtube.com/watch?v=sdUUx5FdySs Ex: "sdUUx5FdySs".

Get notified by email when someone answers this question.

By asking your question, you agree to the Terms of Service and Privacy Policy .*

Asked: January 31, 20252025-01-31T13:25:31+00:00 2025-01-31T13:25:31+00:00In: Deep Learning

Why does my training crash when I increase sequence length in Transformers?

Short sequences work fine.
Longer sequences cause GPU crashes.
No code changes were made.
Only input size increased.

Leave an answer

Leave an answer
Cancel reply

1 Answer

Herbert Schmidt Begginer
2026-01-14T16:18:41+00:00Added an answer on January 14, 2026 at 4:18 pm
This happens because Transformer memory grows quadratically with sequence length. Attention layers store interactions between all token pairs.
Long sequences rapidly exceed GPU memory, even if batch size stays the same.
The practical takeaway is that Transformers are limited by attention scaling, not just model size.
0
Reply
Share
Share
Share on Facebook
Share on Twitter
Share on LinkedIn
Share on WhatsApp

Report