Skip to main content
Cerebras Inference provides high-speed, low-latency AI model inference powered by Cerebras Wafer-Scale Engines and CS-3 systems. Agno integrates directly with the Cerebras Python SDK, allowing you to use state-of-the-art Llama models with a simple interface.

Prerequisites

To use Cerebras with Agno, you need to:
  1. Install the required packages:
    pip install cerebras-cloud-sdk
    
  2. Set your API key: The Cerebras SDK expects your API key to be available as an environment variable:
    export CEREBRAS_API_KEY=your_api_key_here
    

Basic Usage

Here’s how to use a Cerebras model with Agno:
from agno.agent import Agent
from agno.models.cerebras import Cerebras

agent = Agent(
    model=Cerebras(id="llama-4-scout-17b-16e-instruct"),
    markdown=True,
)

# Print the response in the terminal
agent.print_response("write a two sentence horror story")

Supported Models

Cerebras currently supports the following models (see docs for the latest list):
Model NameModel IDParametersKnowledge
Llama 4 Scoutllama-4-scout-17b-16e-instruct109 billionAugust 2024
Llama 3.1 8Bllama3.1-8b8 billionMarch 2023
Llama 3.3 70Bllama-3.3-70b70 billionDecember 2023
DeepSeek R1 Distill Llama 70B*deepseek-r1-distill-llama-70b70 billionDecember 2023
* DeepSeek R1 Distill Llama 70B is available in private preview.

Parameters

ParameterTypeDefaultDescription
idstr"llama-4-scout-17b-16e-instruct"The id of the Cerebras model to use
namestr"Cerebras"The name of the model
providerstr"Cerebras"The provider of the model
parallel_tool_callsOptional[bool]NoneWhether to run tool calls in parallel (automatically set to False for llama-4-scout)
max_completion_tokensOptional[int]NoneMaximum number of completion tokens to generate
repetition_penaltyOptional[float]NonePenalty for repeating tokens (higher values reduce repetition)
temperatureOptional[float]NoneControls randomness in the model’s output (0.0 to 2.0)
top_pOptional[float]NoneControls diversity via nucleus sampling (0.0 to 1.0)
top_kOptional[int]NoneControls diversity via top-k sampling
extra_headersOptional[Any]NoneAdditional headers to include in requests
extra_queryOptional[Any]NoneAdditional query parameters to include in requests
extra_bodyOptional[Any]NoneAdditional body parameters to include in requests
request_paramsOptional[Dict[str, Any]]NoneAdditional parameters to include in the request
api_keyOptional[str]NoneThe API key for authenticating with Cerebras (defaults to CEREBRAS_API_KEY env var)
base_urlOptional[Union[str, httpx.URL]]NoneThe base URL for the Cerebras API
timeoutOptional[float]NoneRequest timeout in seconds
max_retriesOptional[int]NoneMaximum number of retries for failed requests
default_headersOptional[Any]NoneDefault headers to include in all requests
default_queryOptional[Any]NoneDefault query parameters to include in all requests
http_clientOptional[httpx.Client]NoneHTTP client instance for making requests
client_paramsOptional[Dict[str, Any]]NoneAdditional parameters for client configuration
clientOptional[CerebrasClient]NoneA pre-configured instance of the Cerebras client
async_clientOptional[AsyncCerebrasClient]NoneA pre-configured instance of the async Cerebras client
Cerebras is a subclass of the Model class and has access to the same params.

Structured Outputs

The Cerebras model supports structured outputs using JSON schema:
from agno.agent import Agent
from agno.models.cerebras import Cerebras
from pydantic import BaseModel
from typing import List

class MovieScript(BaseModel):
    setting: str
    characters: List[str]
    plot: str

agent = Agent(
    model=Cerebras(id="llama-4-scout-17b-16e-instruct"),
    response_format=MovieScript,
)

Resources

SDK Examples

  • View more examples here.
I