- smartscraper: Extract structured data using natural language prompts
- markdownify: Convert web pages to markdown format
- searchscraper: Search the web and extract information
- crawl: Crawl websites with structured data extraction
- scrape: Get raw HTML content from websites (NEW!)
- Complete HTML source code
- Raw content for further processing
- HTML structure analysis
- Content that needs to be parsed differently
Prerequisites
The following examples require thescrapegraph-py
library.
SGAI_API_KEY
environment variable:
Example
The following agent will extract structured data from a website using the smartscraper tool:cookbook/tools/scrapegraph_tools.py
Raw HTML Scraping
Get complete HTML content from websites for custom processing:cookbook/tools/scrapegraph_tools.py
All Functions with JavaScript Rendering
Enable all ScrapeGraph functions with heavy JavaScript support:cookbook/tools/scrapegraph_tools.py
View the Startup Analyst example
Toolkit Params
Parameter | Type | Default | Description |
---|---|---|---|
api_key | Optional[str] | None | ScrapeGraph API key. If not provided, uses SGAI_API_KEY environment variable. |
enable_smartscraper | bool | True | Enable the smartscraper function for LLM-powered data extraction. |
enable_markdownify | bool | False | Enable the markdownify function for webpage to markdown conversion. |
enable_crawl | bool | False | Enable the crawl function for website crawling and data extraction. |
enable_searchscraper | bool | False | Enable the searchscraper function for web search and information extraction. |
enable_agentic_crawler | bool | False | Enable the agentic_crawler function for automated browser actions and AI extraction. |
enable_scrape | bool | False | Enable the scrape function for retrieving raw HTML content from websites. |
render_heavy_js | bool | False | Enable heavy JavaScript rendering for all scraping functions. Useful for SPAs and dynamic content. |
all | bool | False | Enable all available functions. When True, all enable flags are ignored. |
Toolkit Functions
Function | Description |
---|---|
smartscraper | Extract structured data from a webpage using LLM and natural language prompt. Parameters: url (str), prompt (str). |
markdownify | Convert a webpage to markdown format. Parameters: url (str). |
crawl | Crawl a website and extract structured data. Parameters: url (str), prompt (str), data_schema (dict), cache_website (bool), depth (int), max_pages (int), same_domain_only (bool), batch_size (int). |
searchscraper | Search the web and extract information. Parameters: user_prompt (str). |
agentic_crawler | Perform automated browser actions with optional AI extraction. Parameters: url (str), steps (List[str]), use_session (bool), user_prompt (Optional[str]), output_schema (Optional[dict]), ai_extraction (bool). |
scrape | Get raw HTML content from a website. Useful for complete source code retrieval and custom processing. Parameters: website_url (str), headers (Optional[dict]). |