vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving, designed for high-throughput and memory-efficient LLM serving.

Prerequisites

Install vLLM and start serving a model:

install vLLM

pip install vllm

start vLLM server

vllm serve Qwen/Qwen2.5-7B-Instruct \
    --enable-auto-tool-choice \
    --tool-call-parser hermes \
    --dtype float16 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.9

This spins up the vLLM server with an OpenAI-compatible API.

The default vLLM server URL is http://localhost:8000/

Example

Basic Agent

from agno.agent import Agent
from agno.models.vllm import VLLM

agent = Agent(
    model=VLLM(
        id="meta-llama/Llama-3.1-8B-Instruct",
        base_url="http://localhost:8000/",
    ),
    markdown=True
)

agent.print_response("Share a 2 sentence horror story.")

Advanced Usage

With Tools

vLLM models work seamlessly with Agno tools:

with_tools.py

from agno.agent import Agent
from agno.models.vllm import VLLM
from agno.tools.duckduckgo import DuckDuckGoTools

agent = Agent(
    model=VLLM(id="meta-llama/Llama-3.1-8B-Instruct"),
    tools=[DuckDuckGoTools()],
    markdown=True
)

agent.print_response("What's the latest news about AI?")

View more examples here.

For the full list of supported models, see the vLLM documentation.

Params

Parameter	Type	Default	Description
`id`	`str`	`"microsoft/DialoGPT-medium"`	The id of the model to use with vLLM
`name`	`str`	`"vLLM"`	The name of the model
`provider`	`str`	`"vLLM"`	The provider of the model
`api_key`	`Optional[str]`	`None`	The API key (usually not needed for local vLLM)
`base_url`	`str`	`"http://localhost:8000/v1"`	The base URL for the vLLM server

VLLM is a subclass of the Model class and has access to the same params.

Introduction

Learn

Help

Prerequisites

Example

Advanced Usage

With Tools

Params

Introduction

Learn

Help

​Prerequisites

​Example

​Advanced Usage

​With Tools

​Params

Prerequisites

Example

Advanced Usage

With Tools

Params