The latest vLLM release in 2026 is v0.21.0
Tagged on May 15, 2026 on the vllm-project/vllm GitHub releases page and pushed to PyPI the same day. It supersedes v0.20.2 (May 10, 2026). If you came here looking for the literal version number, that is it. The rest of this page is what v0.21.0 actually carries, whether any of it matters to you, and the two-line block in the Fazm macOS source that makes every vLLM release invisible to the agent driving your Mac apps. This page is checked against the upstream release page on a schedule; the date in the badge above is the last check.
Direct answer, verified 2026-05-16
vLLM v0.21.0, released on May 15, 2026.
Sources I re-checked today: github.com/vllm-project/vllm/releases/tag/v0.21.0, pypi.org/project/vllm, and vllm.ai/releases. All three agree. The PyPI install is pip install vllm==0.21.0. The container is vllm/vllm-openai:v0.21.0.
0 day since release. 0 tagged releases in the prior 0 days. No API surface change. A maintenance and performance cut on the v0.20.x line, with the CUDA 13.0 wheel now the PyPI default.
What is actually inside v0.21.0
v0.21.0 reads as a maintenance-and-performance tag, not a feature release. There is no new model family and no API change. If you skim the changelog on the GitHub release page and group the commits, the meaningful changes fall into three buckets: packaging, supported runtimes, and continued DeepSeek V4 tuning. Two of the three only matter to the operator of the inference box; one of them, the CUDA 13.0 default wheel, can surprise anyone who runs a bare pip install vllm on a CUDA 12 host.
What v0.21.0 carries
- Packaging: the default CUDA wheel published to PyPI and the vllm/vllm-openai container image move to CUDA 13.0. A plain pip install vllm now pulls the CUDA 13 build, so a CUDA 12 host needs an explicit wheel or a rebuild.
- Runtimes: Python 3.14 is added to the supported Python version list. Older Python builds keep working; this widens the floor, it does not raise it.
- Performance: more tuning on the DeepSeek V4 multi-stream pre-attention GEMM path that has been stabilizing across the whole v0.20.x line. GPU-side, server-only.
- Stability: maintenance fixes including a persistent-topk deadlock and an AOT compile cache import error. CUDA-host concerns, invisible to anything consuming the HTTP API.
- No API surface change: the OpenAI Chat Completions and Anthropic Messages endpoints behave exactly as they did on v0.20.x. This is the line that matters for a downstream agent.
Should you actually chase v0.21.0?
Most articles on this question say "yes, always pin the latest." That is bad advice for inference servers. The version you run is a function of three inputs: which models you serve, which host you serve them on, and what your downstream consumer is. For a Mac-agent operator, the downstream consumer is a desktop app talking to your server through an Anthropic-shaped shim. That shape narrows the upgrade question a lot.
Walk it left to right.
The upgrade decision in four steps
- 1
Are you on a CUDA 13 host?
If yes, v0.21.0 is a clean pull and the new default wheel matches your host. If you are still on CUDA 12, a bare pip install vllm now pulls a wheel you cannot run; you need an explicit CUDA 12 wheel or you stay on v0.19.x.
- 2
Do you need Python 3.14?
If your service or its dependencies require Python 3.14, v0.21.0 is the first tag that lists it as supported. If you are on 3.12 or 3.13, nothing changes for you here.
- 3
Are you serving DeepSeek V4?
If yes, the continued multi-stream GEMM tuning in v0.21.0 is the reason to pull it. If you do not run DeepSeek V4 at all, that work never touches your traffic.
- 4
Is your consumer a desktop agent through a shim?
If yes, the version behind the shim is nearly free to change. The shim speaks a stable HTTP contract and v0.21.0 did not break it, so the agent never notices the upgrade.
“Most Mac-agent operators answer 'no' to the first three of those questions and 'yes' to the last. That is the working answer: v0.21.0 is the latest version, and most consumers of vLLM should not chase it. Stay on whatever v0.20.x you already pinned unless you are blocked on CUDA 13 packaging.”
Verified against vLLM v0.21.0 release notes, May 15, 2026
The anchor fact: two lines of Swift that do not care which vLLM you run
The thing every page about vLLM and Mac agents skips is what happens on the consumer side when you upgrade your server. The honest answer is: nothing should happen, and in Fazm nothing does, because the wiring between the Mac agent and your vLLM endpoint is two lines of Swift that read a UserDefaults string and stuff it into an environment variable. That block lives at lines 527 to 530 of Desktop/Sources/Chat/ACPBridge.swift. I just opened the file and pasted the surrounding context verbatim.
// Desktop/Sources/Chat/ACPBridge.swift, lines 527 to 530
// The two lines that make every vLLM release invisible to the Mac agent.
// Custom API endpoint (allows proxying through Copilot, corporate gateways, etc.)
if let customEndpoint = defaults.string(forKey: "customApiEndpoint"),
!customEndpoint.isEmpty {
env["ANTHROPIC_BASE_URL"] = customEndpoint
}When you set Settings > AI Chat > Custom API Endpoint to, say, http://127.0.0.1:4000(your LiteLLM shim in front of vLLM), Fazm's ACP bridge subprocess spawns with ANTHROPIC_BASE_URL set to that value. Every model call from the agent then routes to your shim, and from there to whatever vLLM version is running. Upgrading the server from v0.20.2 to v0.21.0 does not touch this code path. Upgrading from v0.19.x to v0.20.0 with the CUDA 13 jump did not either. The Chat Completions protocol is the contract; v0.21.0 did not change that contract.
That is the practical reason most Mac-agent users do not need to track vLLM releases. The decoupling is one shim process away.
Quick verification commands, if you want to check the version yourself
# PyPI: what does pip see as the latest vllm?
pip index versions vllm | head -1
# vllm (0.21.0)
# GitHub: what does the release machinery report?
curl -fsSL https://api.github.com/repos/vllm-project/vllm/releases/latest \
| jq -r '.tag_name'
# v0.21.0
# Docker Hub: what is the latest tagged container?
docker pull vllm/vllm-openai:latest
docker inspect vllm/vllm-openai:latest --format '{{json .RepoTags}}'
# ["vllm/vllm-openai:latest","vllm/vllm-openai:v0.21.0"]
# vllm.ai/releases: the human-readable index page
open https://vllm.ai/releasesRun any one of these on May 16, 2026 and you should see v0.21.0. If you see something later by the time you read this, the version above is stale; cross-check the vllm-project release page.
What "latest" means on Apple Silicon, which is a separate question
Two things are commonly confused. Vanilla vLLM (the vllm-project/vllm repo) is CPU-only on macOS Apple Silicon through the requirements/cpu.txt build path, and that path tracks the main release cadence. v0.21.0 builds on M-series CPUs the same way v0.20.2 did, and the CUDA 13.0 wheel switch does not touch that path at all.
The community Apple Silicon options that give you actual GPU acceleration on a Mac are the vllm-project/vllm-metal plugin (Metal as the attention backend) and the waybarrios/vllm-mlx fork (MLX with an Anthropic and OpenAI compatible server out of the gate). Neither of those tracks vanilla vLLM patch-for-patch. They cherry-pick the changes that matter for their backend (FA4, MLA prefill, MoE routing) and skip server-infra and CUDA-packaging work. v0.21.0's headline change, the CUDA 13.0 default wheel, is not relevant on Apple Silicon at all.
So "the latest version of vLLM in 2026" on your Mac practically means one of three things: pip install vllm==0.21.0 for CPU-only, vllm-project/vllm-metal main for Metal-backed GPU, or waybarrios/vllm-mlx main for MLX-backed GPU. Pick based on what you are actually serving, not on which has the higher version number.
Want help wiring v0.21.0 (or any version) into a Mac agent?
Book 20 minutes. We'll walk through which version actually makes sense for your workload, the shim choice, and the one Settings field in Fazm that absorbs every vLLM release.
Frequently asked questions
What is the latest version of vLLM right now?
vLLM v0.21.0, tagged on May 15, 2026, available on PyPI and on the vllm-project/vllm GitHub releases page. It supersedes v0.20.2 (May 10, 2026), which followed v0.20.1 (May 3) and v0.20.0 (April 27). Four tagged releases in 18 days; the cadence is unusual but reflects the DeepSeek V4 stabilization push that has run through the entire v0.20.x line and into v0.21.0. Confirmed by re-checking github.com/vllm-project/vllm/releases and pypi.org/project/vllm on May 16, 2026.
What did v0.21.0 actually change?
It is a maintenance-and-performance cut, not a feature release. Three things stand out. (1) The default CUDA wheel published to PyPI and the vllm/vllm-openai container image move to CUDA 13.0, so a plain pip install vllm now pulls the CUDA 13 build. (2) Python 3.14 is added to the supported Python list. (3) The DeepSeek V4 path gets more performance tuning around the multi-stream pre-attention GEMM, plus stability fixes including a persistent-topk deadlock and an AOT compile cache import error. There is no change to the HTTP serving contract: the OpenAI Chat Completions and Anthropic Messages endpoints behave exactly as they did on v0.20.x.
If I run a Mac agent against my vLLM server, do I need to upgrade to v0.21.0?
Almost certainly not, unless you are blocked on CUDA 13 packaging or need Python 3.14. The v0.21.0 changes are CUDA-host and packaging concerns that matter to the operator of the inference box, not the consumer of its API. The reason this is true is that the Fazm app does not talk to vLLM directly. It talks to an Anthropic-shaped shim (LiteLLM, claude-code-router, or vllm-mlx for the all-in-one case) that talks to vLLM. The shim absorbs every server-side change as long as the Chat Completions protocol stays stable, which v0.21.0 did not break. The two-line block in ACPBridge.swift that points the bridge at the shim's URL has not had to change once across v0.18, v0.19, v0.20.0, v0.20.1, v0.20.2, and v0.21.0.
Why are there four vLLM releases in 18 days?
Because v0.20.0 (April 27) landed the DeepSeek V4 work as 'shipping with known constraints', and the line has been stabilizing ever since. v0.20.1 stabilized the multi-stream pre-attention GEMM path, v0.20.2 fixed a persistent-topk regression on Hopper, and v0.21.0 carries more GEMM tuning plus the CUDA 13.0 default-wheel switch and Python 3.14 support. None of those four releases changed the API surface or added a new model family. They are the kind of patch cadence you track if you operate the GPU box and skip if you only consume its endpoint.
Is v0.21.0 available on PyPI, in Docker, and on the vllm-project release page?
Yes to all three. pip install vllm==0.21.0 resolves on PyPI as of May 15, 2026, and a plain pip install vllm now pulls the CUDA 13.0 wheel by default. The vllm/vllm-openai:v0.21.0 container image is published, and the image's default CUDA build also moved to 13.0. The GitHub release page at github.com/vllm-project/vllm/releases/tag/v0.21.0 has the source tag and the changelog. The vllm.ai/releases registry lists every version. None of the community Apple Silicon forks track this tag yet, as is normal for patch cadence.
If I am still on v0.19.x or earlier, what is the practical reason to move?
Two reasons. First, CVE-2026-0994 (a deserialization vulnerability in the prompt_embeds handling of the Completions API, affecting versions 0.10.2 and later) was patched in the v0.19.x cycle. If your server exposes Completions to anything you do not fully trust, that is a real upgrade trigger. Second, v0.20.0 raised the dependency floor: CUDA 13, PyTorch 2.11, Transformers v5, and v0.21.0 made the CUDA 13.0 wheel the PyPI default. If your host is still on CUDA 12, you cannot pull v0.20.x or v0.21.0 without rebuilding the wheel, and that is a meaningful migration. There is no other operational reason to jump past v0.19.x unless you specifically need DeepSeek V4 or the FA4 default MLA prefill.
Will v0.21.0 work on Apple Silicon directly?
Vanilla vLLM still runs CPU-only on macOS Apple Silicon via the requirements/cpu.txt build path, and the v0.21.0 changes are GPU-side and packaging changes that do not touch the CPU backend. That CPU path works but is slow. For practical GPU use on a Mac you want the community vllm-project/vllm-metal plugin or the waybarrios/vllm-mlx fork. Neither tracks vanilla vLLM patch-for-patch; they cherry-pick the changes that matter for their backend (FA4, MLA prefill, MoE routing) and skip server-infra and CUDA-packaging work. v0.21.0's headline change, the CUDA 13.0 default wheel, is irrelevant on Apple Silicon entirely.
What is the cleanest version to pin in production right now if I run a Mac agent against my own vLLM box?
If you are on a CUDA 13 host, pin v0.21.0 and call it done; the changes are strictly additive over v0.20.2. If you cannot move off CUDA 12, pin v0.19.1 and accept that you do not get DeepSeek V4 or FA4 prefill. If you are running on Apple Silicon and serving locally to your laptop, pin nothing on the vanilla repo; pull from vllm-project/vllm-metal main or waybarrios/vllm-mlx main and accept that 'latest' means what their last green CI run produced. None of those choices change the line in Fazm that wires your shim into the agent: ACPBridge.swift reads customApiEndpoint and sets ANTHROPIC_BASE_URL, regardless of which vLLM tag is behind the shim.
Where do I verify the version myself if I do not trust this page?
pip index versions vllm prints the resolvable PyPI versions. curl -fsSL https://api.github.com/repos/vllm-project/vllm/releases/latest then a jq filter on .tag_name prints the tag on the GitHub release machinery, currently v0.21.0 as of May 16, 2026. docker pull vllm/vllm-openai:latest then docker inspect with a jq filter on RepoTags shows what the latest tag points at. The vllm.ai/releases page is the human-readable index. All four agree.
Related vLLM guides
vLLM release May 2026: v0.20.1 patch and the v0.20.0 dep-floor jump
The May 2026 timeline with version numbers and dates, plus the four lines of Swift inside Fazm that make a vLLM server upgrade invisible to your Mac agent.
vLLM release notes 2026: v0.18 and v0.19, and the toggle that wires vLLM into a Mac agent
What actually shipped in v0.18.0 and v0.19.0, the CVE-2026-0994 patch, and the Anthropic-shim wiring that makes any vLLM endpoint the brain of a Mac agent.
Run vLLM locally on Mac and plug it into an AI agent
From curl localhost:8000 to an agent driving Finder, Calendar, WhatsApp. The one Settings field that rewrites ANTHROPIC_BASE_URL to your local vLLM server.
Comments (••)
Leave a comment to see what others are saying.Public and anonymous. No signup.