vLLM is a fast and easy-to-use library for LLM inference and serving, designed for high-throughput and memory-efficient LLM serving.
Prerequisites
Install vLLM and start serving a model:
vllm serve Qwen/Qwen2.5-7B-Instruct \
--enable-auto-tool-choice \
--tool-call-parser hermes \
--dtype float16 \
--max-model-len 8192 \
--gpu-memory-utilization 0.9
This spins up the vLLM server with an OpenAI-compatible API.
The default vLLM server URL is http://localhost:8000/
Example
Basic Agent
from agno.agent import Agent
from agno.models.vllm import VLLM
agent = Agent(
model=VLLM(
id="meta-llama/Llama-3.1-8B-Instruct",
base_url="http://localhost:8000/",
),
markdown=True
)
agent.print_response("Share a 2 sentence horror story.")
Advanced Usage
vLLM models work seamlessly with Agno tools:
from agno.agent import Agent
from agno.models.vllm import VLLM
from agno.tools.duckduckgo import DuckDuckGoTools
agent = Agent(
model=VLLM(id="meta-llama/Llama-3.1-8B-Instruct"),
tools=[DuckDuckGoTools()],
markdown=True
)
agent.print_response("What's the latest news about AI?")
For the full list of supported models, see the vLLM documentation.
Params
Parameter | Type | Default | Description |
---|
id | str | "microsoft/DialoGPT-medium" | The id of the model to use with vLLM |
name | str | "vLLM" | The name of the model |
provider | str | "vLLM" | The provider of the model |
api_key | Optional[str] | None | The API key (usually not needed for local vLLM) |
base_url | str | "http://localhost:8000/v1" | The base URL for the vLLM server |
VLLM
is a subclass of the Model class and has access to the same params.