Readers - Agno

Readers transform raw content into Document objects that can be chunked, embedded, and stored in your knowledge base. Each reader handles a specific format (PDF, CSV, Markdown, etc.) and extracts text and metadata.

from agno.knowledge.reader.pdf_reader import PDFReader

reader = PDFReader(chunk=True, chunk_size=5000)
documents = reader.read("company_handbook.pdf")

How Readers Work

Parse: Read the raw content using format-specific logic
Extract: Pull out text and metadata (page numbers, authors, etc.)
Chunk: Split large content into smaller pieces (if enabled)
Return: Provide a list of Document objects ready for embedding

# Output structure
Document(
    content="The extracted text...",
    id="unique_id",
    name="document_name",
    meta_data={"page": 1, "source": "handbook.pdf"},
)

Supported Readers

Reader	Description
`PDFReader`	Extract text from PDF files
`DoclingReader`	Process multiple formats via Docling
`TextReader`	Plain text files
`MarkdownReader`	Markdown files
`CSVReader`	CSV files (rows become documents)
`FieldLabeledCSVReader`	CSV rows as field-labeled text
`JSONReader`	JSON files
`PPTXReader`	PowerPoint presentations
`ArxivReader`	Academic papers from arXiv
`WikipediaReader`	Wikipedia articles
`YouTubeReader`	YouTube transcripts
`WebsiteReader`	Crawl websites recursively
`WebSearchReader`	Web search results
`FirecrawlReader`	Web scraping via Firecrawl API
`LLMsTxtReader`	Read `llms.txt` files

Using Readers with Knowledge

Pass a reader to knowledge.insert() to override automatic format detection:

from agno.knowledge.knowledge import Knowledge
from agno.knowledge.reader.pdf_reader import PDFReader

knowledge = Knowledge(vector_db=vector_db)

# Use custom reader configuration
reader = PDFReader(chunk_size=3000, split_on_pages=True)
knowledge.insert(path="documents/", reader=reader)

Auto-Selection

Agno automatically selects the right reader based on file extension or URL:

from agno.knowledge.reader.reader_factory import ReaderFactory

# By file extension
reader = ReaderFactory.get_reader_for_extension(".pdf")  # PDFReader
reader = ReaderFactory.get_reader_for_extension(".csv")  # CSVReader

# By URL
reader = ReaderFactory.get_reader_for_url("https://youtube.com/watch?v=...")  # YouTubeReader

When using knowledge.insert(), this happens automatically.

Configuration

Chunking

reader = PDFReader(
    chunk=True,           # Enable chunking (default: True)
    chunk_size=5000,      # Characters per chunk
)

Format-Specific Options

# PDF with encryption and OCR
reader = PDFReader(
    password="secret",
    read_images=True,     # OCR for images
    split_on_pages=True,  # One document per page
)

# CSV with custom encoding
reader = CSVReader(
    encoding="latin-1",
)

# Text with encoding override
reader = TextReader(
    encoding="utf-8",
)

Runtime Options

Override settings when calling read():

documents = reader.read(
    "file.pdf",
    name="custom_document_name",  # Override default naming
    password="runtime_password",  # Password at read time
)

Async Processing

All readers support async for better performance with I/O operations:

import asyncio

# Single file
documents = await reader.async_read("file.pdf")

# Batch processing
tasks = [reader.async_read(file) for file in files]
all_documents = await asyncio.gather(*tasks)

Custom Chunking Strategy

Override the default chunking behavior:

from agno.knowledge.chunking.semantic_chunking import SemanticChunking

reader = PDFReader(
    chunk=True,
    chunking_strategy=SemanticChunking(),
)

See Chunking for available strategies.

Restricting URL Fetches

By default, a URL-fetching reader will fetch any URL passed to it. Use allowed_hosts to restrict the reader to a fixed hostname allowlist. URLs outside the list are skipped and return no documents. Matching is case-insensitive and applies to the whole hostname, so list every subdomain you want to permit.

reader = WebsiteReader(allowed_hosts=["docs.agno.com"])

WebsiteReader, WebSearchReader, and LLMsTxtReader also re-check the allowlist on each redirect, so an allowed host can’t redirect to a blocked one. FirecrawlReader and DoclingReader validate the initial URL only.

Error Handling

Readers return an empty list when processing fails. Check logs for debugging information:

documents = reader.read("corrupted.pdf")
if not documents:
    print("Failed to read file, check logs for details")

Next Steps

PDF Reader

Extract text from PDFs

Website Reader

Crawl and index websites

Chunking

Control how content is split

Vector DB

Store processed documents

Documentation Index

​How Readers Work

​Supported Readers

​Using Readers with Knowledge

​Auto-Selection

​Configuration

​Chunking

​Format-Specific Options

​Runtime Options

​Async Processing

​Custom Chunking Strategy

​Restricting URL Fetches

​Error Handling

​Next Steps

PDF Reader

Website Reader

Chunking

Vector DB

How Readers Work

Supported Readers

Using Readers with Knowledge

Auto-Selection

Configuration

Chunking

Format-Specific Options

Runtime Options

Async Processing

Custom Chunking Strategy

Restricting URL Fetches

Error Handling

Next Steps