If you’ve spent any time building real software with LLMs, you’ve probably had the same thought I’ve had:
“If the model just had more context, this would all be easier.”
And you’re not wrong.
But the uncomfortable truth is that “just make the context window bigger” runs into some very real scaling walls—some mathematical, some hardware-related, and some that are simply product economics. Context windows will keep improving, but continued “straight-line” growth to multi-million tokens for everyone, all the time is difficult and (often) impractical. This article explains some of the math around why we can’t expect context windows to keep doubling and doubling. It’s also worth noting that there are two areas where this matters: training and attention. We’ll look at both below.
The January 2026 State of the Art Context Window Sizes
Before we talk about limits, let’s anchor on what “big” looks like right now.
ChatGPT (GPT-5.2) — context differs by mode/tier
OpenAI publishes ChatGPT-specific limits by mode:
- GPT-5.2 Instant (Fast):
- Free: 16K
- Plus/Business: 32K
- Pro/Enterprise: 128K
- GPT-5.2 Thinking:
- Paid tiers: 196K
Source: OpenAI Help Center
OpenAI API (GPT-5.2) — larger than ChatGPT UI limits
In the API docs, GPT-5.2 is listed with:
- 400,000 token context window (API)
Source: OpenAI API model docs
Claude Sonnet 4.5 — up to 1M tokens (with availability caveats)
Anthropic’s Claude API docs state:
- Claude Sonnet 4 / 4.5: 1M token context window
- Currently in beta and limited to specific usage tiers / orgs
Source: Anthropic Claude API docs
Claude Opus 4.5 — 200K tokens (as documented)
Anthropic’s “What’s new in Claude 4.5” documentation lists Opus 4.5 as:
- 200K token context window
Source: Anthropic Claude API docs
So yes: we’re already in a world where 200K–400K is “normal for premium,” and 1M exists (sometimes).
Now… why isn’t everyone just shipping 5M or 10M next?
The core scaling problem: self-attention wants to look at everything
Most frontier LLMs are still based on the Transformer Architecture:
“In deep learning, the transformer is an artificial neural network architecture based on the multi-head attention mechanism, in which text is converted to numerical representations called tokens, and each token is converted into a vector via lookup from a word embedding table.”
The Transformer’s superpower is also its curse:
- For a sequence of n tokens, self-attention compares tokens pairwise
- Every token is paired with every other token
- That creates an attention pattern that scales roughly like n²
This Cornell paper from 2022 notes that:
Although “Several approaches have been proposed to speed up self-attention mechanisms to achieve sub-quadratic running time; however, the large majority of these works are not accompanied by rigorous error guarantees.” In this paper, “we prove that the time complexity of self-attention is necessarily quadratic in the input length…”
Lacking “rigorous error guarantees” simply means there is no upper bound on the degree of potential inaccuracies that may result. So, yes, you can go faster, but the results may be wrong, and we can’t say with any certainty how often that might happen or how wrong they might be.
What “quadratic” means in developer terms
If you double your context length, you don’t pay 2×.
You pay about 4× for the attention work.
So going from:
- 200K → 400K is ~4× attention cost
- 400K → 800K is another ~4× (~16× 200K)
- 800K → 1.6M is another ~4× (~64× 200K)
That’s the part that feels exponential in practice—even though the math is quadratic.
Here’s a tiny “napkin math” snippet:
Let attention cost (cost) ∝ n²
If n = 200,000 tokens:
cost = 40,000,000,000 (40 Billion)
If n = 1,000,000 tokens (5× bigger):
cost = 1,000,000,000,000 (1 Trillion) (25× bigger)
That’s the scaling pressure AI companies are fighting.
Large Context Sizes Impact Training as Well
A common misconception is:
“Long context is just a bigger buffer at inference time.”
This is, unfortunately, incorrect.
Long context affects:
Inference (runtime)
- More compute for attention
- More memory pressure
- Larger Key-Value caches
- Lower batching efficiency (and batching is where your throughput comes from)
- Higher latency and cost per request
Training
Training is worse because you’re also doing backprop, storing activations, etc. If you want models that truly understand long sequences (and not just “kinda tolerate them”), you generally need:
- Long-context training data
- Long-context training runs (expensive)
- Long-context evaluation harnesses (also expensive)
And yes, some models can be extended beyond their original training length using positional scaling techniques—but that doesn’t magically make long-context cheap or perfect.
Eventually You Hit a Hardware Wall
Even if you had infinite GPU memory, you still run into:
- Memory bandwidth limits
- Kernel efficiency limits (attention kernels can only be optimized so far)
- Interconnect limits (multi-GPU attention isn’t free)
- Fragmentation and caching behavior (real systems get messy)
In practice, “bigger context” often means:
- Lower concurrency
- Higher tail latency
- More expensive infrastructure
- More engineering complexity
- More failure modes
That’s why you’ll see tiering so pricing can reflect some of these realities:
- bigger context in the API than in the consumer UI
- bigger context reserved for higher usage tiers
- “beta” flags on the truly huge windows
It’s not because vendors hate you; it’s because the bill shows up somewhere.
Bigger context doesn’t automatically mean better outcomes
There’s another uncomfortable truth:
A huge context window can become a huge junk drawer.
A model can’t treat every token as equally important. If you dump:
- logs
- specs
- partial code
- old requirements
- stale decisions
- contradictory notes
you will often get a worse result than if you gave it less but better-structured context.
This is the “software architecture” version of long-context prompting:
- The model is your runtime.
- Your prompt is your dependency graph.
- If you don’t prune and design it, you’re building a big ball of mud.
It’s a signal-to-noise ratio problem. A smaller context window that only contains relevant information is going to deliver better results than a large context with a lot of irrelevant details that must be analyzed before ultimately being discarded (or, in many cases, not discarded when they should have been).
Why “just keep growing context” becomes impractical
Here are the scaling constraints that tend to dominate as you push beyond today’s typical ranges:
1) Cost grows faster than value
Early context growth is high ROI:
- 4K → 32K is a game-changer
- 32K → 200K is very useful for certain workflows
- 200K → 1M is great for specific tasks
But beyond that, many tasks benefit more from:
- retrieval
- summarization
- working memory
- tool use
than from raw context expansion.
2) Training economics get brutal
Even if inference is “manageable,” training long-context models at scale is on another level in terms of cost and complexity. You’re trading a lot of compute to unlock a capability that many users won’t use on most requests.
3) Latency and throughput suffer
Large shared systems need to support many concurrent requests and users. But with larger context windows, you can end up with a slower overall system that doesn’t scale as well as a smaller system would have (given the same hardware). The result:
- one slow request can starve the whole system of resources
- infrequent, unpredictable lags frustrate users, ruining UX
- concurrency is king
Huge windows can make relatively simple queries (that should return results quickly) “pay” for the capacity that’s in place to support the worst-case (most complex) queries.
4) Organization and relevance become the actual bottleneck
At large scales, the hard problem becomes:
- “What should the model pay attention to?”
not:
- “Can we fit it all?”
The more scalable path: memory systems, not bigger buffers
Instead of treating context as an ever-growing input, the industry is increasingly leaning toward:
- retrieval (Retrieval Augmented Generation (RAG)
- hierarchical summaries Reference
- tool-backed memory Various Memory Systems
- file- and repo-aware indexing Example
- agent workflows that stage context deliberately See Ralph Wiggum loop, et al
This is the difference between:
- keeping everything in RAM forever
vs
- building a filesystem with indexes, caches, and policies
In other words: context windows aren’t going away—but the winning systems won’t rely on them alone.
A practical example: “monolith prompt” vs “modular prompt”
If you’ve ever pasted a giant codebase into a model and gotten back nonsense, you’ve learned this the hard way.
The “monolith prompt”
- paste everything
- ask a vague question
- hope the model finds what matters
The “modular prompt”
- provide a narrow goal
- provide only the relevant files
- provide constraints and acceptance criteria
- ask for an incremental plan + tests
Here’s a structure that consistently performs better than “here’s my whole repo, no go do X”:
Goal:
- Add feature X without breaking Y
Constraints:
- Must keep endpoint contracts
- Must not change database schema
- Must keep <performance target>
Relevant files:
- FooController.cs
- FooService.cs
- FooRepository.cs
- Existing tests: FooServiceTests.cs
Ask:
1) Propose plan (phases/steps + risks)
2) Provide minimal diff
3) Provide tests
4) Explain tradeoffsThis is how you keep long context from becoming expensive noise. And yes, in some cases, you may be able to have one agent provide the above prompt iteratively to subagents, providing a means to take bigger bites of the apple with less human interaction. Just be aware this can go off the rails and can also consume a LOT of tokens.
Help Out Future You: build for “bounded context” instead of “infinite context”
If you’re building LLM features into real systems, assume context windows will improve—but don’t bet your architecture on infinite growth.
Instead:
- Store data in systems that are queryable (search + metadata)
- Build a retrieval layer (possibly via MCP or similar) that can explain what it pulled and why
- Summarize aggressively with lossy but useful artifacts:
- Decisions
- Constraints
- Interfaces
- Invariants
- Treat “prompt assembly” like dependency injection:
- Select only what you need
- Avoid/minimize global state
- Keep things composable
- Remember to start new context windows for new tasks
- Often forgotten in IDE chat sessions!
You’ll get:
- Better answers
- Lower cost
- Less latency
- Fewer “model drowned in context” failures
- Fewer “summarizing conversation” messages (which indicate you’re seeing context compaction)
And you won’t be blocked when context growth hits its practical ceiling.
Bottom line
Context windows will keep growing in some form—but raw context length is already hitting diminishing returns relative to:
- quadratic attention costs
- training economics
- product latency/throughput constraints
- the human (and model) problem of relevance and organization
The future looks less like:
“We fit the whole world in the prompt”
…and more like:
“We give the model the right slice of the world, on demand.”
Which, honestly, is how we build good software anyway.
References
- Transformer (deep learning architecture) — Wikipedia
- On the Computational Complexity of Self-Attention (arXiv)
- GPT-5 in ChatGPT — OpenAI Help Center (context windows by tier/mode)
- GPT-5.2 model docs — OpenAI API (400K context)
- Claude context windows — Anthropic Claude API docs (Sonnet 4/4.5 1M beta)
- What’s new in Claude 4.5 — Anthropic Claude API docs (Opus 4.5 200K)
- Retrieval Augmented Generation (RAG)
- Hierarchical Summarization
- Various Memory Systems
- Your repo has secrets: indexing tells AI where they are
- Ralph Wiggum loop, et al
- Big Ball of Mud
- Backprop Explainer
- Context Compaction


