Documentation Index
Fetch the complete documentation index at: https://docs.agno.com/llms.txt
Use this file to discover all available pages before exploring further.
Every labeler on the other pages takes text. To label other modalities, change the input argument and the model. The schema and the output_schema pattern stay the same.
from typing import Literal
from agno.agent import Agent
from agno.media import Image
from agno.models.google import Gemini
from pydantic import BaseModel, Field
class Classification(BaseModel):
label: Literal["dog", "cat", "bird", "fish", "other"] = Field(
..., description="What kind of animal is in the image"
)
agent = Agent(
model=Gemini(id="gemini-3.5-flash"),
instructions="You classify images by animal type.",
output_schema=Classification,
)
url = "https://upload.wikimedia.org/wikipedia/commons/4/4d/Cat_November_2010-1a.jpg"
result = agent.run("Classify this image.", images=[Image(url=url)]).content
# Classification(label='cat')
| Modality | Import | Argument | Model in the cookbook |
|---|
| Image | from agno.media import Image | images=[Image(url=...)] | Gemini(id="gemini-3.5-flash") |
| Audio | from agno.media import Audio | audio=[Audio(content=...)] | Gemini(id="gemini-3.5-flash") |
| Video | from agno.media import Video | videos=[Video(content=..., format="mp4")] | Gemini(id="gemini-3.5-flash") |
| PDF | from agno.media import File | files=[File(url=...)] | Gemini(id="gemini-3.5-flash") |
Image and File accept a url. Audio and Video take raw bytes via content; fetch them first.
import requests
from agno.media import Audio
audio_bytes = requests.get("https://example.com/clip.mp3").content
agent.run("Transcribe this.", audio=[Audio(content=audio_bytes)])
Bounding boxes
For region detection, return normalized coordinates so the result is resolution-independent.
from pydantic import BaseModel, Field
class BoundingBox(BaseModel):
label: str = Field(..., description="What the box contains")
x: float = Field(..., ge=0.0, le=1.0, description="Top-left x in [0, 1]")
y: float = Field(..., ge=0.0, le=1.0, description="Top-left y in [0, 1]")
width: float = Field(..., ge=0.0, le=1.0, description="Width in [0, 1]")
height: float = Field(..., ge=0.0, le=1.0, description="Height in [0, 1]")
The per-field description on x, y, width, and height is load-bearing. Without it, and without the [0, 1] convention spelled out in the instructions, models return degenerate boxes (all-zero or whole-image). Spell out the coordinate system in both places.
Transcription and diarization
Audio extraction covers transcription, speaker diarization, and timestamped segments. Each is a schema change, not a different API.
| Output | Schema shape |
|---|
| Flat transcript | { text: str } |
| Speaker turns | { turns: List[{ speaker, text }] } |
| Timestamped segments | { segments: List[{ start_seconds, end_seconds, text }] } |
Model choice
gemini-3.5-flash handles text, image, audio, video, and PDF natively, so the cookbook uses it across every modality. Each cookbook README notes alternatives if you want to swap.
Next steps
| Task | Guide |
|---|
| Define the output schema | Structured extraction |
| Assign labels to media | Classification |
| Review media labels | Quality pipeline |
Developer Resources