Skip to main content
Context Compression allows you to manage your agent context while it is running, helping the agent stay within its context window and avoid rate limits or decreases in response quality. Think of it like a research assistant who reads lengthy reports and gives you the key bullet points instead of the full documents.

The Problem: Verbose Tool Results

If you are using tools with large response sizes, without compression, tool results quickly consume your context window:
ComponentCumulative Token CountNotes
System Prompt1,200 tokens
User Message1,300 tokens
LLM Response1,500 tokens
Tool Call 12,500 tokens
Tool Call 25,700 tokens2,500 + 3,200 new
Tool Call 38,500 tokens5,700 + 2,800 new
Tool Call 412,000 tokens8,500 + 3,500 new
This quickly becomes expensive and hits context limits during complex workflows.

The Solution: Automatic Compression

Context compression summarizes tool results after a threshold:
Tool Call 1: 2,500 tokens
Tool Call 2: 5,700 tokens
Tool Call 3: 8,500 tokens
[Compression triggered]
Tool Call 4: 1,300 tokens (800 compressed + 500 new)
Benefits:
  • Dramatically reduced token costs
  • Stay within context window limits
  • Preserve critical facts and data
  • Automatic compression

How It Works

Context compression follows a simple pattern:
1

Enable Compression

Set compress_tool_results=True on your agent or team. This comes with a default threshold of 3 tool calls. The system monitors tool call results as they come in.
2

Threshold Reached

After the threshold is reached, compression is triggered. Each uncompressed tool call result is individually summarized.
3

Intelligent Summarization

The compression model preserves key facts (numbers, dates, entities, URLs) while removing boilerplate, redundancy, and filler text.
4

The LLM loop continues

The compressed tool results are used in the next LLM executions, reducing token usage and extending the life of your context window.
When using arun on Agent or Team, compression is handled asynchronously and the uncompressed tool call results are summarised concurrently.

Enable Compression

Turn on compress_tool_results=True to automatically compress tool results. This comes with a default threshold of 3 tool calls. For example:
from agno.agent import Agent
from agno.models.openai import OpenAIChat
from agno.tools.duckduckgo import DuckDuckGoTools

agent = Agent(
    model=OpenAIChat(id="gpt-4o"),
    tools=[DuckDuckGoTools()],
    compress_tool_results=True,
)

agent.print_response("Research each of the following topics: AI, Crypto, Web3, and Blockchain")
You can also enable compress_tool_results=True on individual team members to compress their tool results independently.

Custom Compression

Provide a CompressionManager to customize the compression behavior:
from agno.agent import Agent
from agno.compression.manager import CompressionManager
from agno.models.openai import OpenAIChat
from agno.tools.duckduckgo import DuckDuckGoTools

compression_manager = CompressionManager(
    model=OpenAIChat(id="gpt-4o-mini"),  # Use a faster model for compression
    compress_tool_results_limit=2,  # Compress after 2 tool calls (default: 3)
    compress_tool_call_instructions="Your custom compression prompt here...",
)

agent = Agent(
    model=OpenAIChat(id="gpt-4o"),
    tools=[DuckDuckGoTools()],
    compression_manager=compression_manager,
)

agent.print_response("Find recent funding rounds for AI startups")
Use a faster, cheaper model like gpt-4o-mini for compression to reduce latency and cost while using a more capable model as your Agent’s main model.

When to Use Context Compression

Perfect for:
  • Agents with tools that return verbose results (web search, APIs)
  • Multi-step workflows with many tool calls
  • Long-running sessions where context accumulates
  • Production systems where cost matters

Developer Resources