Skip to main content
Learn how to evaluate your Agno Agents and Teams across multiple dimensions - accuracy (simple correctness checks), agent as judge (custom quality criteria), performance (runtime and memory), and reliability (tool calls).

Evaluation Dimensions

Accuracy

The accuracy of the Agent’s response using LLM-as-a-judge methodology.

Agent as Judge

Evaluate custom quality criteria using LLM-as-a-judge with scoring.

Performance

The performance of the Agent’s response, including latency and memory footprint.

Reliability

The reliability of the Agent’s response, including tool calls and error handling.

Quick Start

Here’s a simple example of running an accuracy evaluation:
quick_eval.py
from typing import Optional
from agno.agent import Agent
from agno.eval.accuracy import AccuracyEval, AccuracyResult
from agno.models.openai import OpenAIResponses
from agno.tools.calculator import CalculatorTools

# Create an evaluation
evaluation = AccuracyEval(
    model=OpenAIResponses(id="gpt-5.2"),
    agent=Agent(model=OpenAIResponses(id="gpt-5.2"), tools=[CalculatorTools()]),
    input="What is 10*5 then to the power of 2? do it step by step",
    expected_output="2500",
    additional_guidelines="Agent output should include the steps and the final answer.",
)

# Run the evaluation
result: Optional[AccuracyResult] = evaluation.run(print_results=True)

Best Practices

  • Start Simple: Begin with basic accuracy tests before progressing to complex performance and reliability evaluations
  • Use Multiple Test Cases: Don’t rely on a single test case—build comprehensive test suites that cover edge cases
  • Track Over Time: Monitor your eval metrics continuously as you iterate on your agents
  • Combine Dimensions: Evaluate across all three dimensions for a holistic view of agent quality

Guides

Dive deeper into each evaluation dimension:
  1. Accuracy Evals - Learn LLM-as-a-judge techniques and multiple test case strategies
  2. Agent as Judge Evals - Define custom quality criteria with flexible scoring strategies
  3. Performance Evals - Measure latency, memory usage, and compare different configurations
  4. Reliability Evals - Test tool calls, error handling, and rate limiting behavior