Deploying Gemma 4 26B A4B on an RTX 5090
Notes on standing up a private Gemma 4 26B A4B inference endpoint on an RTX 5090 with vLLM — the dead ends, the working setup, and the reasoning behind each decision.
I wanted a fast, private, tool-calling-capable Gemma 4 26B A4B endpoint to use with my coding agent. The final setup runs at ~196 tok/s decode on a single 5090, supports full tool calling, and handles 96k context. Getting there took a solid afternoon of working through CUDA drivers, a broken vLLM nightly, two different quantization formats, and RunPod’s host heterogeneity.
This post is the writeup I wish I’d had going in. It includes both the final working configuration and the dead ends, because anyone doing this is going to hit at least some of them.
If you want to self-host Gemma 4 26B A4B (or a similar recent MoE) via vLLM on consumer Blackwell hardware, most of this should generalize to other 2026-era MoE models on 5090s.
TL;DR — The Working Configuration
If you just want the answer:
Container image
vllm/vllm-openai:gemma4 Container start command (RunPod Serverless, args-after-entrypoint style)
serve cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit --max-model-len 96000 --kv-cache-dtype fp8 --gpu-memory-utilization 0.95 --enable-auto-tool-choice --tool-call-parser gemma4 --reasoning-parser gemma4 --chat-template examples/tool_chat_template_gemma4.jinja --async-scheduling --host 0.0.0.0 --port 8000 Environment variables
HF_TOKEN=<your token>
HF_HOME=/runpod-volume/huggingface Storage
- Container disk: 30 GB
- Network volume: 50 GB (holds model cache across pod restarts)
Host filter
- Minimum CUDA 12.9 (critical — see “Dead End #1” below)
Performance achieved
- Decode throughput: ~196 tok/s
- Cold start: ~95 seconds (first time); faster with cached torch.compile artifacts
- TTFT: 1-3s warm, 10s+ cold first request
The rest of this post is why every line above is what it is, and what I tried first that didn’t work.
Why This Model, Why This Hardware
Gemma 4 26B A4B is an MoE model: 26B total params but only 4B active per token. That combination suits decode-heavy workloads on a single consumer GPU — you get the capability of a 26B model with the memory bandwidth cost of a 4B model during generation. Benchmarks put it competitive with 30B+ dense models on coding and reasoning tasks.
RTX 5090 has 32 GB of GDDR7 at 1,792 GB/s memory bandwidth — roughly 3x an M5 Max and approaching H100 territory. For single-user decode, memory bandwidth dominates, so a 5090 running a 4-bit-quantized 26B MoE is a good fit for the job.
I’d previously been running the unquantized model on an H100 SXM through RunPod. It worked fine, but for a single-user coding agent the hourly rate was hard to justify. Moving to a 4-bit quant on a 5090 was fundamentally a cost decision: the quantized weights fit comfortably in 32 GB with room for a large KV cache, and per-second serverless billing on consumer Blackwell is a much better fit for personal infra than reserving datacenter silicon I only use a few hours a day.
The Plan (What I Thought Would Work)
Original plan, in order:
- Use the NVFP4-quantized weights:
RedHatAI/gemma-4-26B-A4B-it-NVFP4. NVFP4 is Blackwell-native 4-bit floating point, supposedly the fastest possible format on a 5090. - Use the
vllm/vllm-openai:gemma4image, cut specifically for the Gemma 4 launch. - Stand up a RunPod Serverless Load Balancer endpoint on a 5090.
- Add tool calling flags, point opencode at it, done.
Not much of that worked on the first try.
Dead End #1: CUDA 12.9 Driver Mismatch
What happened: Container wouldn’t even start. I got:
nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.9,
please update your driver to a newer version The vllm/vllm-openai:gemma4 image is built against CUDA 12.9. RunPod’s Serverless fleet is heterogeneous — some hosts have NVIDIA driver 570+ (supports CUDA 12.9), some don’t. You don’t get to pick a specific host, you get whatever the scheduler gives you.
The fix: RunPod has a “Min CUDA Version” filter in the endpoint settings. Setting it to 12.9 means the scheduler only routes workers to hosts with a compatible driver. This shrinks your available host pool but guarantees the image will start.
Lesson: On consumer GPUs in cloud environments, driver versions are not homogeneous. Datacenter cards like H100s tend to be kept current; consumer cards (4090, 5090) are much more variable. Always check for a CUDA/driver filter when using bleeding-edge image tags. If your provider doesn’t offer one, you may need to pull an older image built against an older CUDA version — at the cost of missing features your model needs.
Dead End #2: The NVFP4 MoE Loading Bug
Host filter applied, image started. NVFP4 kernels initialized cleanly:
Using NvFp4LinearBackend.FLASHINFER_CUTLASS for NVFP4 GEMM
Using 'VLLM_CUTLASS' NvFp4 MoE backend Then, during weight loading, it failed with:
KeyError: 'layers.0.experts.0.down_proj.input_global_scale' The RedHat model card had warned this would happen: NVFP4 MoE support for Gemma 4 specifically required vLLM PR #39045, which hadn’t been merged into the :gemma4 tagged image. Linear NVFP4 worked; MoE NVFP4 didn’t — the expert weight name mapping wasn’t in place.
What I tried: Switch to vllm/vllm-openai:nightly (which has the PR merged on main).
What happened instead: The nightly was itself broken that day:
ModuleNotFoundError: No module named 'pandas' Someone had landed a commit to _aiter_ops.py (AMD’s AITER kernel library) that unconditionally imported pandas without adding it to the image’s dependencies. This was a classic half-merged refactor: worked on the maintainer’s dev machine, broke the Docker image for everyone.
Lesson #1: Nightlies are broken more often than you’d think — maybe a few times per month for fast-moving projects like vLLM. Don’t build critical deployments on nightly tags. If you absolutely need a PR that isn’t in a stable release yet, pin to a specific dated nightly that you’ve verified works, or be prepared to patch the image at startup.
Lesson #2: “Supported in vLLM” has granularity. A quantization format might work for linear layers but not MoE experts, or work for dense models but not routing layers. Always check whether a specific combination of model + format + layer type is known to work. Running the model card’s example command on the exact image the quant author used to test is the lowest-risk path.
The Actual Working Path: AWQ with Marlin
Stepping back: I needed a 4-bit MoE quantization that actually loads today, on a stable image, on Blackwell.
AWQ (Activation-aware Weight Quantization) checks every box I cared about:
- Mature in vLLM: W4A16 (4-bit weights, 16-bit activations) has been supported for well over a year. The MoE path uses Marlin kernels, which are production-grade.
- Works on Blackwell: AWQ doesn’t require FP4 tensor cores. It loads INT4 weights, dequantizes to FP16 in-register inside a fused kernel, and runs matmul on FP16 tensor cores. No emulation, no exotic code paths.
- Quality: Typically retains 97-99% of bf16 quality.
cyankiwi/gemma-4-26B-A4B-it-AWQ-4bitis 17.2 GB on disk and fits comfortably in 32 GB with room for a large KV cache.
Performance tradeoff vs. NVFP4
NVFP4 is theoretically faster on Blackwell because it uses native FP4 tensor cores — no dequantization step at all. But for single-user decode on a memory-bandwidth-bound MoE, the difference is smaller than you’d expect:
| Format | Weight load bandwidth | Compute path | Relative decode speed |
|---|---|---|---|
| NVFP4 | 4x vs FP16 | FP4 tensor cores (native Blackwell) | 100% |
| FP8 | 2x vs FP16 | FP8 tensor cores | ~85-90% |
| AWQ INT4 | 4x vs FP16 | FP16 tensor cores (after fused dequant) | ~75-85% |
For single-user chat where you’re waiting on memory bandwidth, AWQ closes most of the gap because both formats achieve the same 4x weight compression. NVFP4 wins more on prefill and batching (where compute dominates) than on decode.
Real-world decode on my setup: ~196 tok/s on AWQ. I’d probably see 220-240 with NVFP4 if it had worked. For a single-user coding agent, that difference doesn’t really matter.
Dead End #3: Load Balancer Host Flake
After switching to AWQ, I still hit:
RuntimeError: Unexpected error from cudaGetDeviceCount().
Error 804: forward compatibility was attempted on non supported HW Error 804 is CUDA’s “forward compat mode” failing. Even with the CUDA 12.9 filter, a small fraction of hosts had a driver state where forward compat was attempted but not supported by that specific GPU. The host was technically in the 12.9+ pool but had an edge-case config that broke CUDA init.
The fix: Terminate the worker, let a new one spawn. I got lucky on the next try. If it had persisted, raising the CUDA filter to 13.0+ would have excluded the problematic hosts at the cost of a smaller fleet.
Lesson: Cloud GPU hosts are individually unreliable. Your recovery strategy needs to include “restart and try again,” not just “debug the specific failure.” On Serverless in particular, a worker is a short-lived thing — treat it that way.
The Final Setup, Annotated
Container image
vllm/vllm-openai:gemma4 Purpose-built image for the Gemma 4 launch. Stable, has the AWQ Marlin paths, includes the chat template file. Don’t use :latest unless you know what vLLM version it points to. Don’t use :nightly unless you’ve tested today’s build.
Start command
serve cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit --max-model-len 96000 --kv-cache-dtype fp8 --gpu-memory-utilization 0.95 --enable-auto-tool-choice --tool-call-parser gemma4 --reasoning-parser gemma4 --chat-template examples/tool_chat_template_gemma4.jinja --async-scheduling --host 0.0.0.0 --port 8000 Flag-by-flag:
serve <model>— vLLM’s serve subcommand. Note: model name is a positional arg, not--model(deprecated). If you’re getting “unrecognized arguments” errors, this is usually why.--max-model-len 96000— Gemma 4 26B A4B supports 256k native, but 96k is plenty for coding agents and fits comfortably with the KV cache. I started at 8192 to prove the base setup worked, then scaled up.--kv-cache-dtype fp8— Halves KV cache memory. Blackwell has native FP8 so there’s no perf penalty. Combined with AWQ weights, you get a huge effective context window.--gpu-memory-utilization 0.95— Aggressive but safe with the rest of the config. Leaves headroom for CUDA graphs and activations.--enable-auto-tool-choice+--tool-call-parser gemma4— Both required for tool calling to actually work. The first enables the feature; the second tells vLLM how to parse Gemma 4’s tool-call special tokens.--reasoning-parser gemma4— Enables the<|channel>thought\n...<channel|>parser. Thinking mode is off by default but configurable per-request; the parser is always needed.--chat-template examples/tool_chat_template_gemma4.jinja— Overrides the default chat template with one optimized for tool calling. Path is relative to vLLM’s install dir; the:gemma4image has this file. If your image doesn’t, download from the vLLM repo and mount it.--async-scheduling— Overlaps request scheduling with decoding. Small throughput boost, ~5-10%.
Environment variables
HF_TOKEN=<your token>
HF_HOME=/runpod-volume/huggingface HF_HOME points HuggingFace’s cache to your persistent network volume. This is the single most important env var for cloud deployments — without it, you redownload the model (14+ GB for AWQ) every cold start.
Storage on RunPod Serverless
- Container disk: 30 GB. Billed per worker slot whether idle or not (~$0.10/GB/month). Don’t over-provision.
- Network volume: 50 GB (~$0.07/GB/month, shared across all workers). Holds the model and torch.compile cache.
If you’re running max workers = 1 (correct for single-user), your idle storage is ~$6.50/month total. Raising max workers to 3 “in case” would triple your container disk cost with zero benefit.
Client-Side: Talking to the Endpoint
Standard OpenAI-compatible Python client:
from openai import OpenAI
from dotenv import load_dotenv
import os
load_dotenv()
client = OpenAI(
base_url="https://<your-endpoint-id>.api.runpod.ai/v1",
api_key=os.environ['RUNPOD_API_KEY']
)
response = client.chat.completions.create(
model="cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit",
messages=[
{"role": "user", "content": "Write a Python function to parse a CSV."}
],
max_tokens=2048,
temperature=1.0,
top_p=0.95,
extra_body={"top_k": 64},
stream=True,
)
for chunk in response:
if chunk.choices[0].delta.content is not None:
print(chunk.choices[0].delta.content, end="", flush=True) Three things easy to get wrong:
- Model name must exactly match
--modelin the start command. If vLLM servedcyankiwi/...-AWQ-4bitand you sendgoogle/gemma-4-26B-A4B-it, you get a 404. HitGET /v1/modelson your endpoint to see the registered name. - Sampling params. Gemma 4’s recommended defaults are
temperature=1.0, top_p=0.95, top_k=64. The model’sgeneration_config.jsonapplies these automatically when you don’t override, but being explicit never hurts. - Streaming. RunPod’s Load Balancer has a request timeout. Streaming keeps the connection alive with data flowing, so you never hit it. Use
stream=Trueby default.
Benchmarking: The Numbers
vLLM’s log line for throughput is averaged over a window that includes idle time. To measure actual performance, time a real request:
import time
start = time.time()
first_token_time = None
tokens = 0
response = client.chat.completions.create(
model="cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit",
messages=[{"role": "user", "content": "Write 500 words on transformers."}],
max_tokens=1024,
stream=True,
)
for chunk in response:
if chunk.choices[0].delta.content is not None:
if first_token_time is None:
first_token_time = time.time()
tokens += 1
end = time.time()
ttft = first_token_time - start
decode_time = end - first_token_time
decode_tps = tokens / decode_time
print(f"TTFT: {ttft:.2f}s")
print(f"Decode: {decode_tps:.1f} tok/s") My results:
TTFT: 10.94s (first request, cold)
Decode: 195.8 tok/s
Total tokens: 731 TTFT drops to 1-3s on warm subsequent requests. Decode stays around 180-200 tok/s. For context, an M5 Max runs this model at roughly 20-30 tok/s — for single-user decode, a 5090 is a significant step up.
What I’d Do Differently
Start with the most mature quantization format, not the newest. I wasted hours on NVFP4 when AWQ would have worked immediately. The right default for new model releases is: if a known-good quant exists in a format vLLM has supported for 6+ months (AWQ, GPTQ, FP8), start there. Only reach for bleeding-edge formats (NVFP4) once stable vLLM releases have supported them for at least a few weeks.
Use the :purpose-built image tag, not nightly. The vllm/vllm-openai:gemma4 tag exists because someone at vLLM blessed it as working for Gemma 4. Nightly is for PR validation, not deployment.
Set the CUDA filter before the first deployment attempt. Not after your first error. It’s a one-line change in the RunPod config and eliminates an entire class of failure.
Don’t deploy through the Load Balancer UI without understanding per-worker billing. Container disk gets billed per worker slot, not per active worker. Max workers = 1 for single-user is almost always right.
Plan for cold-start latency in your application. 95 seconds is a long time to wait for your first response. If latency matters, set active workers = 1 and eat the continuous cost. If not, accept the cold start.
Total Cost
For my single-user setup on RunPod Serverless:
- Container disk (30 GB × 1 worker slot × $0.10/GB/month): $3/month
- Network volume (50 GB × $0.07/GB/month): $3.50/month
- Compute: per-second billing only during active requests. For moderate coding agent usage (~2 hours of active GPU time per day at 5090 rates), roughly $15-30/month.
Total: ~$22-37/month for a private, fast, tool-calling-capable Gemma 4 26B endpoint. Compare to Claude API usage at similar volume, or a local setup requiring a $2,000 GPU purchase.
Deploy it yourself
The whole config above is packaged as a RunPod template — same image, same flags, same env vars. Pick a 5090, drop in your HF_TOKEN, and it should come up identically to what I’m running.
If you’re new to RunPod and sign up through this link, both of us get a one-time credit somewhere between $5 and $500 after your first $10 top-up. In the interest of being upfront: I also earn a small share of spend from referred users during their first six months (5% on Serverless, 3% on Pods), and the template itself earns 1% of the revenue it generates. None of that changes what you pay. If you were going to try RunPod anyway, those links are a low-effort way to split the benefit.
References & Links
- vLLM documentation
- Gemma 4 on HuggingFace
- cyankiwi AWQ quant
- RedHatAI NVFP4 quant (for when vLLM stable catches up to PR #39045)
- Gemma 4 chat template
- RunPod Serverless docs
If you run into something I didn’t cover, let me know — the MLOps deployment surface is too broad for one post to cover everything. This is the setup I’ll be running until Gemma 5 shows up.