November 3, 2025
AI Claude Development Intermediate

Understanding Context Windows: How to Work Within Token Limits

You're halfway through solving a complex problem. You've got five documents open, a long conversation history, and a detailed codebase context loaded. Then it hits you: "I'm running low on tokens." Sound familiar?

The context window is one of those invisible constraints that suddenly becomes very visible when you bump into it. You can't just keep dumping everything into the conversation and expect Claude to magically remember and synthesize it all. There's a hard limit to what fits in the window at once, and understanding how to work within that limit, rather than fighting against it, changes how effectively you can use Claude for complex work.

This guide walks you through context windows, token budgets, and the practical strategies that experienced Claude users rely on when they're juggling massive projects.

Table of Contents
  1. What Is a Context Window, Really?
  2. Why Context Limits Matter (Beyond the Technical Reason)
  3. Understanding Token Budgets
  4. The /compact Command: Your Context Cleanup Tool
  5. Retrieval-Augmented Generation (RAG): Bringing Data In On-Demand
  6. Prompt Caching: The Token Multiplier
  7. Long Conversations: When to Compact vs. When to Start Fresh
  8. Projects: Persistent Context Without Conversation Bloat
  9. Real-World Context Management Playbook
  10. The Hidden Performance Wins
  11. Decision Tree: What's Your Next Move?
  12. Summary: Working *Within* Your Limits

What Is a Context Window, Really?

Let's start with the basics. A context window is the total amount of text Claude can "see" at once. It's measured in tokens, which are roughly equivalent to 4 characters or about 0.75 words. Think of it as your working memory. Everything you feed Claude plus everything Claude generates has to fit within this window.

Claude comes in several flavors:

  • Claude Sonnet 4: 200,000 token context (≈150,000 words or ~500 pages)
  • Claude Opus 4.5: 200,000 token standard, 500,000 token enterprise option
  • Claude Sonnet 4 Batch: 200,000 token context with async processing
  • Beta models: Up to 1,000,000 tokens (currently invite-only)

Here's what that looks like in practical terms: 200,000 tokens is roughly the equivalent of a 400-page book. That sounds like a lot until you realize you're not just fitting one document in there. You're fitting:

  • Your entire conversation history
  • The system prompt
  • Multiple documents you're referencing
  • Code files
  • Configuration examples
  • Your current prompt

All competing for the same limited space.

Why Context Limits Matter (Beyond the Technical Reason)

Yes, there's a hard maximum. You literally cannot exceed your context window. Claude will refuse to continue if you try. But there's something more subtle happening: quality degrades as context gets longer.

This is the "lost in the middle" problem. Research shows that when you pack a context window to the brim, Claude's ability to retrieve information from the middle of the context suffers. The beginning and end stay sharp, but that important detail you buried in the third document? Increasingly likely to be overlooked.

Think about attention like a resource. When the context is lean and focused, Claude allocates full attention to understanding your problem. When the context is bloated, attention gets divided. You get responses that are technically within the bounds of what's possible, but not optimally thoughtful.

There's also a cost consideration. Every token you send to Claude costs money. Every token Claude generates costs money. Bloated contexts mean higher API bills, slower response times, and more latency for real-time interactions.

So the goal isn't just "fit everything in." The goal is "fit the right things in, in the right way."

Understanding Token Budgets

Before we talk about strategies, let's get concrete about your actual token budget.

Assuming you're working with a 200,000 token context window, here's how a typical interaction breaks down:

Total Context Window: 200,000 tokens

System Prompt:          ~1,500 tokens (fixed)
Conversation History:   ~50,000 tokens (grows with each turn)
Your Current Prompt:    ~2,000 tokens
Supporting Documents:   ~80,000 tokens
Code Samples:           ~30,000 tokens
Claude's Response:      ~15,000 tokens (variable, but budget it)
Buffer/Safety Margin:   ~21,500 tokens (never fill 100%)

= 200,000 tokens consumed

Notice the buffer? You never want to hit the hard ceiling. That last 10-15% should stay empty. When you get to 85-90% capacity, that's when you need to act.

The /compact Command: Your Context Cleanup Tool

When you're working in Claude Code or Claude.ai, the /compact command is your best friend for context management. Here's what it does:

The command analyzes your conversation history and creates a condensed summary that captures the essential information while dumping the verbose back-and-forth. Instead of 40,000 tokens of conversation, you might compress it down to 8,000 tokens of pure facts and decisions.

When to use /compact:

  • You've had 15+ back-and-forth exchanges on the same topic
  • Your conversation history exceeds 40,000 tokens
  • You're moving to a new phase of work (finished design, starting implementation)
  • You notice Claude's responses getting shorter or less thoughtful
  • You're about to load a bunch of new context but don't want to start fresh

Here's what a /compact session looks like:

Human: /compact

Claude: Analyzing conversation history...

COMPRESSION SUMMARY
===================
Topic: API Design for User Authentication System
Key Decisions Made:
- Using JWT tokens with 24-hour expiration
- OAuth2 flow for third-party integrations
- Bcrypt for password hashing (12 rounds)
- Rate limiting: 100 requests/minute per IP

Open Questions:
- Database choice (PostgreSQL vs MongoDB)
- Refresh token rotation strategy

Files Referenced:
- /auth/models.ts
- /auth/middleware.ts

Original tokens: 47,200
Compressed tokens: 8,900
Space freed: 38,300 tokens (81% reduction)

The magic here is that you've freed up 38,000 tokens to use on new work. You can now load fresh context without hitting your ceiling.

Pro tip: Don't wait until you're at 95% capacity to compact. Run it preemptively when you hit 70-75%. You'll maintain better response quality and avoid the panic of running out of space mid-project.

Retrieval-Augmented Generation (RAG): Bringing Data In On-Demand

RAG is a fancy term for a simple idea: instead of dumping all your data into the context window at once, you retrieve only the relevant pieces when you need them.

Here's the problem RAG solves: You have a 50,000-line codebase. You can't fit it all in context. So you'd normally have to cherry-pick files to include. But what if you miss the crucial function defined three files over? What if you don't know it exists in the first place?

RAG says: "Let's have a smart system that finds the relevant code/docs/data when you ask for it."

In practice, this works like this:

  1. Indexing phase: Your knowledge base (docs, code, API specs) gets broken into chunks and encoded into vectors
  2. Query phase: When you ask Claude a question, the system finds the most semantically similar chunks
  3. Retrieval phase: Only those relevant chunks get injected into Claude's context
  4. Generation phase: Claude answers your question using just what it needs

Example: You're asking Claude to optimize a database query. Instead of loading your entire schema, Claude retrieves:

  • The specific table definition
  • Two related indexes
  • Three historical queries on that table
  • One documentation page about your query optimizer

Total: 4,000 tokens instead of 40,000. Claude gets focused, relevant context.

Setting up basic RAG yourself:

If you're working with your own documents, you can DIY a simple version:

python
from anthropic import Anthropic
 
client = Anthropic()
 
# Your knowledge base (simplified)
knowledge_base = {
    "auth": "JWT tokens expire after 24 hours...",
    "database": "PostgreSQL version 14 running with...",
    "api": "REST endpoints return JSON with status codes..."
}
 
def retrieve_relevant_docs(query, kb):
    """Find docs matching query intent"""
    query_lower = query.lower()
    relevant = []
 
    for topic, content in kb.items():
        if any(word in query_lower for word in topic.split()):
            relevant.append(content)
 
    return "\n".join(relevant)
 
def answer_with_rag(user_query):
    """Answer using only relevant context"""
    relevant_context = retrieve_relevant_docs(user_query, knowledge_base)
 
    system_prompt = f"""You are a helpful assistant. Use only the provided context.
Context:
{relevant_context}"""
 
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        system=system_prompt,
        messages=[
            {"role": "user", "content": user_query}
        ]
    )
 
    return response.content[0].text
 
# Usage
answer = answer_with_rag("How do authentication tokens work?")
print(answer)

This is a toy example, but the pattern is real. Production RAG systems use vector databases like Pinecone or Weaviate, but the principle is identical: retrieve, then generate.

Prompt Caching: The Token Multiplier

Here's one of Claude's best-kept secrets: prompt caching. It works like this: you can cache the static parts of your prompt and only pay once for them, even across multiple API calls.

There are two caching tiers:

5-minute cache (automatic, free)

  • Kicks in automatically for any prompt longer than 1,024 tokens
  • Prefix is cached for 5 minutes
  • If you make another call within 5 minutes, you don't pay for those cached tokens again
  • Estimated 90% cost savings for iterative work

1-hour cache (paid, but worth it)

  • For enterprise or via the Batch API
  • Much longer cache window
  • Still costs less than regular tokens (typically 10% of standard rate)

Here's where this gets powerful. Say you're building a chatbot for your product documentation (2,000 tokens of docs). Without caching, every API call costs you those 2,000 tokens. With caching, the second call onwards only costs you the new tokens you add.

Scenario: 10 API calls, 2,000 tokens of static docs

Without caching:
2,000 × 10 = 20,000 tokens paid

With 5-minute cache:
2,000 (first call) + 100 × 9 (subsequent calls) = 2,900 tokens paid
Cost reduction: ~85%

How to use caching in practice:

python
from anthropic import Anthropic
 
client = Anthropic()
 
# Your static context (docs, examples, guidelines)
static_context = """
Product Documentation:
- Feature A works by...
- Common pitfalls...
- API responses always include...
[... much more content ...]
"""
 
def chat_with_cached_context(conversation_history):
    """Reuse cached docs across multiple conversations"""
 
    system_blocks = [
        {
            "type": "text",
            "text": "You are a product support assistant."
        },
        {
            "type": "text",
            "text": static_context,
            "cache_control": {"type": "ephemeral"}
        }
    ]
 
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        system=system_blocks,
        messages=conversation_history
    )
 
    return response

The cache_control parameter tells Claude: "Please cache this part of the system prompt." Subsequent calls to this function will reuse the cached context, saving tokens and money.

Long Conversations: When to Compact vs. When to Start Fresh

Here's the decision tree that experienced Claude users follow:

Use /compact if:

  • You're still working on the same problem
  • You need to preserve key decisions and context
  • You've made real progress but your context is getting bloated
  • You want to keep the conversation history but reduce its size
  • You're transitioning between major phases (design → code)

Start a fresh conversation if:

  • You're moving to an entirely different topic
  • The old context is becoming a distraction
  • You need a clean mental model without accumulated history
  • You're hitting diminishing returns with compact
  • More than 2 hours have passed since you last worked on it

Here's the nuance: starting fresh feels clean, but you lose the context of what you've already decided. /compact gives you the best of both. You keep your decisions but free up tokens.

Pro move: Keep a working document (or use Claude Projects) where you track key decisions, architecture diagrams, and open questions. This becomes your external memory. When you start fresh or compact, this document is your lifeline.

Projects: Persistent Context Without Conversation Bloat

If you're using Claude.ai or Claude API with Projects, you've got another option: persistent context storage that doesn't count against your conversation window.

A Project is like a folder that "remembers" your work across conversations. You can:

  • Upload reference documents once
  • They stay available to future conversations
  • Each conversation starts fresh, but with project-level context always present
  • No conversation history bloat

This is perfect for:

  • Long-running development efforts
  • Building against a large codebase
  • Ongoing research projects
  • Client work where you need consistent context across weeks

Example Project structure for an API development:

ProjectName: E-commerce API v2
├── Specification (pinned)
│   ├── API_spec_v2.0.md
│   ├── Data_models.json
│   └── Authentication_flow.md
├── Reference Code
│   ├── Legacy_v1_implementation/
│   └── Third_party_integrations/
├── Documentation
│   ├── Setup_guide.md
│   └── Deployment_checklist.md

Now when you start a new conversation in this project, all these documents are available to Claude without being in your context window. You ask a question, Claude retrieves what it needs from the project. Clean. Efficient.

Real-World Context Management Playbook

Let's put this all together with actual scenarios:

Scenario 1: Building a moderate-sized feature (5-10 files, 1-2 weeks)

Week 1:

  • Load architecture docs (5,000 tokens)
  • Load 3 key files (15,000 tokens)
  • Start implementing (conversation grows)
  • At day 3, compact when hitting 70% capacity
  • Continue development

Week 2:

  • Use Project to keep docs persistent
  • Start fresh conversation with Project context
  • Load 2 different files you now need
  • Reference decisions from previous week's documentation

Scenario 2: Iterating on a design (lots of back-and-forth, quick turns)

  • Use 5-minute prompt caching with your static design docs
  • Have multiple parallel conversations for different aspects
  • Use /compact more aggressively (every 20-30 exchanges)
  • Keep a decision log outside Claude to track what you've settled on

Scenario 3: Processing large documents (research, analysis)

  • Use RAG to index your documents
  • Ask specific questions that retrieve only relevant sections
  • Build a research summary as you go (external document)
  • Each conversation is fresh, pulling only what's needed

The Hidden Performance Wins

Here's what most people don't realize: optimizing your context isn't just about saving tokens. It's about better answers.

When Claude has less context to manage, it can think more deeply about each piece. Your questions get more thoughtful responses. Reasoning is clearer. Mistakes decrease.

There's a sweet spot for context size:

  • Too little: Claude lacks information to solve your problem
  • Too much: Claude's attention gets fragmented
  • Just right: Focused, relevant context that lets Claude shine

That sweet spot is usually 30-60% of your available context for most problems. You're leaving plenty of room for Claude's reasoning and output.

Decision Tree: What's Your Next Move?

flowchart TD
    Q1{"Context window<br/>at 75%+ capacity?"}
    Q2{"Still the same<br/>problem/project?"}
    Q3{"About to load<br/>new context?"}
    Q4{"Need conversation<br/>history?"}
 
    A1["Run /compact"]
    A2["Start fresh conversation"]
    A3["Compact first"]
    A4["Fresh conversation is fine"]
    A5["Keep going, you're fine"]
 
    Q1 -->|YES| Q2
    Q1 -->|NO| Q3
    Q2 -->|YES| A1
    Q2 -->|NO| A2
    Q3 -->|YES| Q4
    Q3 -->|NO| A5
    Q4 -->|YES| A3
    Q4 -->|NO| A4
 
    style Q1 fill:#f59e0b,stroke:#0f172a,stroke-width:2px,color:#0f172a
    style Q2 fill:#ec4899,stroke:#0f172a,stroke-width:2px,color:#0f172a
    style Q3 fill:#3b82f6,stroke:#0f172a,stroke-width:2px,color:#0f172a
    style Q4 fill:#8b5cf6,stroke:#0f172a,stroke-width:2px,color:#0f172a
    style A1 fill:#22c55e,stroke:#0f172a,stroke-width:2px,color:#0f172a
    style A2 fill:#22c55e,stroke:#0f172a,stroke-width:2px,color:#0f172a
    style A3 fill:#22c55e,stroke:#0f172a,stroke-width:2px,color:#0f172a
    style A4 fill:#22c55e,stroke:#0f172a,stroke-width:2px,color:#0f172a
    style A5 fill:#22c55e,stroke:#0f172a,stroke-width:2px,color:#0f172a

Summary: Working Within Your Limits

Context windows aren't constraints to fight against. They're part of the design. Your job is to work with them:

  1. Know your budget: 200K tokens for standard Claude, but don't use all of it
  2. Monitor capacity: Compact at 70-75%, not at 90%
  3. Use RAG for big knowledge bases: Retrieve, don't load everything upfront
  4. Leverage caching: Free tokens are the best tokens
  5. Keep decisions external: Document what you decide outside Claude
  6. Use Projects for long-term work: Persistent context without conversation bloat
  7. Start fresh when context is no longer helping: Sometimes a clean slate is better

The experienced Claude users aren't the ones desperately trying to squeeze everything into context. They're the ones who know how to organize information so Claude always has exactly what it needs, no more, no less.

That's the difference between struggling with limits and working confidently within them.

Need help implementing this?

We build automation systems like this for clients every day.

Discuss Your Project