Skip to main content
vLLM is a fast and easy-to-use library for LLM inference and serving, designed for high-throughput and memory-efficient LLM serving.

Prerequisites

Install vLLM and start serving a model:
install vLLM
pip install vllm
start vLLM server
vllm serve Qwen/Qwen2.5-7B-Instruct \
    --enable-auto-tool-choice \
    --tool-call-parser hermes \
    --dtype float16 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.9
This spins up the vLLM server with an OpenAI-compatible API.
The default vLLM server URL is http://localhost:8000/

Example

Basic Agent
from agno.agent import Agent
from agno.models.vllm import VLLM

agent = Agent(
    model=VLLM(
        id="meta-llama/Llama-3.1-8B-Instruct",
        base_url="http://localhost:8000/",
    ),
    markdown=True
)

agent.print_response("Share a 2 sentence horror story.")

Advanced Usage

With Tools

vLLM models work seamlessly with Agno tools:
with_tools.py
from agno.agent import Agent
from agno.models.vllm import VLLM
from agno.tools.duckduckgo import DuckDuckGoTools

agent = Agent(
    model=VLLM(id="meta-llama/Llama-3.1-8B-Instruct"),
    tools=[DuckDuckGoTools()],
    markdown=True
)

agent.print_response("What's the latest news about AI?")
View more examples here.
For the full list of supported models, see the vLLM documentation.

Params

ParameterTypeDefaultDescription
idstr"microsoft/DialoGPT-medium"The id of the model to use with vLLM
namestr"vLLM"The name of the model
providerstr"vLLM"The provider of the model
api_keyOptional[str]NoneThe API key (usually not needed for local vLLM)
base_urlstr"http://localhost:8000/v1"The base URL for the vLLM server
VLLM is a subclass of the Model class and has access to the same params.
I