What is Evals

Evaluation Dimensions

Accuracy

The accuracy of the Agent’s response using LLM-as-a-judge methodology.

Agent as Judge

Evaluate custom quality criteria using LLM-as-a-judge with scoring.

Performance

The performance of the Agent’s response, including latency and memory footprint.

Reliability

The reliability of the Agent’s response, including tool calls and error handling.

Quick Start

Here’s a simple example of running an accuracy evaluation:

quick_eval.py

from typing import Optional
from agno.agent import Agent
from agno.eval.accuracy import AccuracyEval, AccuracyResult
from agno.models.openai import OpenAIResponses
from agno.tools.calculator import CalculatorTools

# Create an evaluation
evaluation = AccuracyEval(
    model=OpenAIResponses(id="gpt-5.2"),
    agent=Agent(model=OpenAIResponses(id="gpt-5.2"), tools=[CalculatorTools()]),
    input="What is 10*5 then to the power of 2? do it step by step",
    expected_output="2500",
    additional_guidelines="Agent output should include the steps and the final answer.",
)

# Run the evaluation
result: Optional[AccuracyResult] = evaluation.run(print_results=True)

Best Practices

Start Simple: Begin with basic accuracy tests before progressing to complex performance and reliability evaluations

Use Multiple Test Cases: Don’t rely on a single test case—build comprehensive test suites that cover edge cases

Track Over Time: Monitor your eval metrics continuously as you iterate on your agents

Combine Dimensions: Evaluate across all three dimensions for a holistic view of agent quality

Guides

Dive deeper into each evaluation dimension:

Accuracy Evals - Learn LLM-as-a-judge techniques and multiple test case strategies

Agent as Judge Evals - Define custom quality criteria with flexible scoring strategies

Performance Evals - Measure latency, memory usage, and compare different configurations

Reliability Evals - Test tool calls, error handling, and rate limiting behavior

Documentation Index

​Evaluation Dimensions