Document
objects that can be embedded, chunked, and stored in vector databases.
What are Readers?
A Reader is a specialized component that knows how to parse and extract content from specific data sources or file formats. Think of readers as translators that convert different content formats into a standardized format that Agno can work with. Every piece of content that enters your knowledge base must pass through a reader first. The reader’s job is to:- Parse the raw content from its original format
- Extract the meaningful text and metadata
- Structure the content into
Document
objects - Apply chunking strategies to break large content into manageable pieces
How Readers Work
All readers inherit from the baseReader
class and follow a consistent pattern:
The Reading Process
When a reader processes content, it follows these steps:- Content Ingestion: The reader receives raw content (file, URL, text, etc.)
- Parsing: Extract text and metadata using format-specific logic
- Document Creation: Convert parsed content into
Document
objects - Chunking: Apply chunking strategies to break content into smaller pieces
- Return: Provide a list of processed documents ready for embedding
Content Types and Specialization
Each reader specializes in handling specific content types:- Use format-specific parsing libraries
- Extract relevant metadata
- Handle format-specific challenges (encryption, encoding, etc.)
- Optimize processing for that content type
Reader Configuration
Readers are highly configurable to meet different processing needs:Chunking Control
Content Processing Options
Encoding Control
For text-based readers, you can override the file encoding:Metadata and Naming
The Document Output
Readers convert raw content intoDocument
objects with this structure:
Chunking Integration
One of the most important features of readers is their integration with chunking strategies:Automatic Chunking
Whenchunk=True
, readers automatically apply chunking strategies to break large documents into smaller, more manageable pieces:
Chunking Strategy Support
Different readers support different chunking strategies based on their content type:Reader Factory and Auto-Selection
Agno provides intelligent reader selection through theReaderFactory
:
Supported Readers
The following readers are currently supported:Reader Name | Description |
---|---|
ArxivReader | Fetches and processes academic papers from arXiv |
CSVReader | Parses CSV files and converts rows to documents |
FieldLabeledCSVReader | Converts CSV rows to field-labeled text documents |
FirecrawlReader | Uses Firecrawl API to scrape and crawl web content |
JSONReader | Processes JSON files and converts them into documents |
MarkdownReader | Reads and parses Markdown files |
PDFReader | Reads and extracts text from PDF files |
TextReader | Handles plain text files |
WebsiteReader | Crawls entire websites following links recursively |
WebSearchReader | Searches and reads web search results |
WikipediaReader | Searches and reads Wikipedia articles |
YouTubeReader | Extracts transcripts and metadata from YouTube videos |
Async Processing
All readers support asynchronous processing for better performance:Usage in Knowledge
Readers integrate seamlessly with Agno Knowledge:Best Practices
Choose the Right Reader
- Use specialized readers for better extraction quality
- Consider format-specific features (PDF encryption, CSV delimiters, etc.)
Configure Chunking Appropriately
- Smaller chunks for precise retrieval
- Larger chunks for maintaining context
- Use semantic chunking for structured documents
Optimize for Performance
- Use async readers for I/O-heavy operations
- Batch process multiple files when possible
- Cache readers through ReaderFactory when processing many files
Handle Errors Gracefully
- Readers return empty lists for failed processing
- Check reader logs for debugging information
- Provide fallback readers for unknown formats