Wire Any Model Into Claude Code — Self-Hosted Routing Guide

Three variables, no patching

Both models in this guide come down to the same three switches. The base URL moves the destination; the auth token authenticates against it; the custom-model option adds your alias to the picker. Set them per shell, or drop them in ~/.claude/settings.json to make them stick.

env · destination

ANTHROPIC_BASE_URL

Where requests go. Any server that answers /v1/messages — including your own.

self-hostedvLLM · SGLang · llama.cpp

env · identity

ANTHROPIC_AUTH_TOKEN

The bearer token for that endpoint. Use any placeholder for servers you own.

local"dummy" — unvalidated

Set ANTHROPIC_AUTH_TOKEN, not ANTHROPIC_API_KEY, when pointing at your own base URL — and leave ANTHROPIC_API_KEY empty so Claude Code doesn't silently fall back to Anthropic auth. vLLM accepts a dummy value for both.

The cluster route — claude-glm

GLM 5.2 is Z.ai's flagship coding model (released mid-June 2026): a 753B-parameter MoE with a usable 1M-token context, MIT-licensed open weights, and two reasoning tiers. The FP8 checkpoint is ~750 GB — multi-GPU server territory — but once it's up, vLLM exposes the Anthropic Messages API directly, so Claude Code talks to it with zero translation layer.

753B

params · MoE · FP8

context window

~750GB

VRAM · 8× H200 class

step 1 — download + serve (vLLM 0.23+)

# ~750 GB of FP8 safetensors — budget 30-60 min on 10 GbE
huggingface-cli download zai-org/GLM-5.2-FP8 \
  --local-dir /models/glm-5.2-fp8 \
  --local-dir-use-symlinks False

# shard the weights across 8 GPUs; serve the Anthropic API
vllm serve "zai-org/GLM-5.2-FP8" \
  --tensor-parallel-size 8 \
  --max-model-len 262144 \
  --kv-cache-dtype fp8 \
  --enable-prefix-caching \
  --served-model-name claude-glm \
  --host 0.0.0.0 --port 8000

step 2 — use-glm.sh

# claude-glm → your vLLM cluster
export ANTHROPIC_BASE_URL=http://glm-host:8000
export ANTHROPIC_AUTH_TOKEN=dummy      # vLLM doesn't validate it
export ANTHROPIC_API_KEY=dummy

# the --served-model-name you set above must match these
export ANTHROPIC_MODEL=claude-glm
export ANTHROPIC_DEFAULT_OPUS_MODEL=claude-glm
export ANTHROPIC_DEFAULT_SONNET_MODEL=claude-glm
export ANTHROPIC_DEFAULT_HAIKU_MODEL=claude-glm
export CLAUDE_CODE_SUBAGENT_MODEL=claude-glm

# let the agent use the full million-token window
export CLAUDE_CODE_AUTO_COMPACT_WINDOW=1000000

claude

⚠

On vLLM ≤ 0.17.1, Claude Code's per-request system-prompt hash defeats prefix caching and throughput collapses. Either run vLLM > 0.17.1, or add "CLAUDE_CODE_ATTRIBUTION_HEADER": "0" to the env block of settings.json. Inside a session, run /effort max to engage GLM 5.2's deepest reasoning tier.

Prefer SGLang? Its RadixAttention reuses the long shared prompt across turns and can roughly triple requests/sec at the same hardware: python -m sglang.launch_server --model-path zai-org/GLM-5.2-FP8 --tp 8 --context-length 262144. Same wiring afterward.

The on-device route — claude-fable5

Not everyone has eight H200s. gemma-4-12B-coder (the Fable 5 × Composer 2.5 distill) is a 12B coding fine-tune that fits in ~4.5 GB of VRAM at Q2 and runs on a single laptop. Recent llama-server builds expose the same Anthropic-compatible endpoint, so the wiring is identical — just a different host.

12B
params · gemma4 arch
256K
context window
~4.5GB
min VRAM @ Q2_K

Quant	Size	Best for
Q2_K	4.83 GB	tiniest — runs almost anywhere
Q3_K_M	6.09 GB	good for 8 GB VRAM
Q4_K_M	7.38 GB	the sweet spot — recommended
Q6_K	9.79 GB	near-lossless
Q8_0	12.7 GB	basically full quality

step 1 — pull + serve

# grab a recent llama.cpp — this is the gemma4_unified arch,
# older builds won't load it
brew install llama.cpp

# pull the recommended quant from the Hub and serve it.
# --alias is the name Claude Code will send in each request.
llama-server \
  -hf yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF:Q4_K_M \
  --alias "claude-fable5" \
  --ctx-size 16384 \
  --n-gpu-layers 99 \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  -fa on --no-mmap \
  --temp 1.0 --top-p 0.95 --top-k 64 \
  --host 0.0.0.0 --port 18080

step 2 — use-fable5.sh

# claude-fable5 → your local llama-server
export ANTHROPIC_BASE_URL=http://localhost:18080
export ANTHROPIC_AUTH_TOKEN="local"     # unvalidated, any string
export ANTHROPIC_API_KEY=""

# the --alias you served must match these exactly
export ANTHROPIC_MODEL=claude-fable5
export ANTHROPIC_DEFAULT_SONNET_MODEL=claude-fable5
export ANTHROPIC_DEFAULT_HAIKU_MODEL=claude-fable5
export ANTHROPIC_DEFAULT_OPUS_MODEL=claude-fable5
export CLAUDE_CODE_SUBAGENT_MODEL=claude-fable5

claude

The --alias string and ANTHROPIC_MODEL must match character-for-character — that single string is the whole handshake. This model is Python/algorithmic-focused and not safety-aligned, so add your own guardrails for anything production.

Naming them in the /model picker

The sourceable scripts above give you one launch command per model. To make the aliases visible and selectable inside a running session, register them as custom model options. Claude Code skips validation here, so the name can be any string your endpoint accepts.

# adds a tidy entry to the /model picker
export ANTHROPIC_CUSTOM_MODEL_OPTION=claude-glm
export ANTHROPIC_CUSTOM_MODEL_OPTION_NAME="claude-glm"
export ANTHROPIC_CUSTOM_MODEL_OPTION_DESCRIPTION="GLM 5.2 · self-hosted vLLM"

The _NAME and _DESCRIPTION overrides take effect when ANTHROPIC_BASE_URL points at a gateway or self-hosted server. They have no effect when you're connected directly to api.anthropic.com. Need several versions of one family mapped to distinct IDs? Use the modelOverrides setting in settings.json instead.

Keep one script per model — use-glm.sh, use-fable5.sh — and switch with a single source. That's your model selector.

switching, day to day

# repo-scale, 1M-context work on the cluster
source ./use-glm.sh

# private, offline session on the laptop
source ./use-fable5.sh

# back to the real thing
unset ANTHROPIC_BASE_URL ANTHROPIC_MODEL
claude

When it doesn't connect

The failures you'll actually hit on self-hosted endpoints, and the one-line fix for each.

+Connection refused on launch

The inference server isn't up yet — the most common issue. Confirm vllm serve (or llama-server) is answering at its port before launching Claude Code. For the cluster, the FP8 download and 8-GPU shard load take minutes; wait for the ready log line.

+404 Not Found on background calls

Claude Code is still firing the haiku / sonnet / opus aliases, which resolve to Anthropic IDs your endpoint doesn't serve. Map all of them to your --served-model-name (or --alias) with the ANTHROPIC_DEFAULT_*_MODEL variables shown above.

+Throughput collapses under load

Two usual causes on a GLM cluster. First, the per-request system-prompt hash defeating prefix caching — fix with vLLM > 0.17.1 or CLAUDE_CODE_ATTRIBUTION_HEADER=0. Second, KV-cache saturation: watch vllm:gpu_cache_usage_perc on /metrics and back off once you cross ~90% sustained.

+It answers, but ignores tools or stalls mid-task

Claude Code's agentic loop was tuned for Claude's tool-calling. Open models can struggle with long read → think → edit chains. For GLM, run /effort max and confirm tool calling is enabled in your serving stack. For the on-device model, make sure you're on a recent llama.cpp build that supports the gemma4_unified architecture, and scope tasks smaller.

+Stuck on Anthropic even after setting the URL

A leftover ANTHROPIC_API_KEY can override your endpoint token. Set ANTHROPIC_AUTH_TOKEN for self-hosted endpoints (vLLM accepts a dummy value). To fully reset, unset ANTHROPIC_BASE_URL and relaunch.