claude-code · model-router ~/.claude/settings.json self-hosted on-device aliases
Custom model routing · self-hosted

Run claude‑glm and
claude‑fable5 on
hardware you own

Claude Code speaks the Anthropic Messages API. Point it at your own infrastructure and it routes every request there — no forks, no cloud. This wires up two models you host yourself: GLM 5.2 on a multi-GPU cluster for repo-scale work, and a 12B GGUF on a single laptop — each under its own named alias in the /model picker.

GLM-5.2-FP8 · vLLM cluster gemma-4-12B-coder · on-device 3 env vars · 0 forks
how the redirect works ANTHROPIC_BASE_URL rewrites the destination — nothing else changes
Origin Claude Code Emits requests in /v1/messages format, same as always.
claude-glm →
claude-fable5 →
Endpoint · cluster glm-host:8000 vLLM serves the Anthropic Messages API natively.
Endpoint · on-device localhost:18080 llama-server, also Anthropic-compatible out of the box.
01

Three variables, no patching

Both models in this guide come down to the same three switches. The base URL moves the destination; the auth token authenticates against it; the custom-model option adds your alias to the picker. Set them per shell, or drop them in ~/.claude/settings.json to make them stick.

env · destination
ANTHROPIC_BASE_URL
Where requests go. Any server that answers /v1/messages — including your own.
self-hostedvLLM · SGLang · llama.cpp
env · identity
ANTHROPIC_AUTH_TOKEN
The bearer token for that endpoint. Use any placeholder for servers you own.
local"dummy" — unvalidated
!
Set ANTHROPIC_AUTH_TOKEN, not ANTHROPIC_API_KEY, when pointing at your own base URL — and leave ANTHROPIC_API_KEY empty so Claude Code doesn't silently fall back to Anthropic auth. vLLM accepts a dummy value for both.
02

The cluster route — claude-glm

GLM 5.2 is Z.ai's flagship coding model (released mid-June 2026): a 753B-parameter MoE with a usable 1M-token context, MIT-licensed open weights, and two reasoning tiers. The FP8 checkpoint is ~750 GB — multi-GPU server territory — but once it's up, vLLM exposes the Anthropic Messages API directly, so Claude Code talks to it with zero translation layer.

753B
params · MoE · FP8
1M
context window
~750GB
VRAM · 8× H200 class
step 1 — download + serve (vLLM 0.23+)
# ~750 GB of FP8 safetensors — budget 30-60 min on 10 GbE
huggingface-cli download zai-org/GLM-5.2-FP8 \
  --local-dir /models/glm-5.2-fp8 \
  --local-dir-use-symlinks False

# shard the weights across 8 GPUs; serve the Anthropic API
vllm serve "zai-org/GLM-5.2-FP8" \
  --tensor-parallel-size 8 \
  --max-model-len 262144 \
  --kv-cache-dtype fp8 \
  --enable-prefix-caching \
  --served-model-name claude-glm \
  --host 0.0.0.0 --port 8000
step 2 — use-glm.sh
# claude-glm → your vLLM cluster
export ANTHROPIC_BASE_URL=http://glm-host:8000
export ANTHROPIC_AUTH_TOKEN=dummy      # vLLM doesn't validate it
export ANTHROPIC_API_KEY=dummy

# the --served-model-name you set above must match these
export ANTHROPIC_MODEL=claude-glm
export ANTHROPIC_DEFAULT_OPUS_MODEL=claude-glm
export ANTHROPIC_DEFAULT_SONNET_MODEL=claude-glm
export ANTHROPIC_DEFAULT_HAIKU_MODEL=claude-glm
export CLAUDE_CODE_SUBAGENT_MODEL=claude-glm

# let the agent use the full million-token window
export CLAUDE_CODE_AUTO_COMPACT_WINDOW=1000000

claude
On vLLM ≤ 0.17.1, Claude Code's per-request system-prompt hash defeats prefix caching and throughput collapses. Either run vLLM > 0.17.1, or add "CLAUDE_CODE_ATTRIBUTION_HEADER": "0" to the env block of settings.json. Inside a session, run /effort max to engage GLM 5.2's deepest reasoning tier.
i
Prefer SGLang? Its RadixAttention reuses the long shared prompt across turns and can roughly triple requests/sec at the same hardware: python -m sglang.launch_server --model-path zai-org/GLM-5.2-FP8 --tp 8 --context-length 262144. Same wiring afterward.
03

The on-device route — claude-fable5

Not everyone has eight H200s. gemma-4-12B-coder (the Fable 5 × Composer 2.5 distill) is a 12B coding fine-tune that fits in ~4.5 GB of VRAM at Q2 and runs on a single laptop. Recent llama-server builds expose the same Anthropic-compatible endpoint, so the wiring is identical — just a different host.

12B
params · gemma4 arch
256K
context window
~4.5GB
min VRAM @ Q2_K
QuantSizeBest for
Q2_K4.83 GBtiniest — runs almost anywhere
Q3_K_M6.09 GBgood for 8 GB VRAM
Q4_K_M7.38 GBthe sweet spot — recommended
Q6_K9.79 GBnear-lossless
Q8_012.7 GBbasically full quality
step 1 — pull + serve
# grab a recent llama.cpp — this is the gemma4_unified arch,
# older builds won't load it
brew install llama.cpp

# pull the recommended quant from the Hub and serve it.
# --alias is the name Claude Code will send in each request.
llama-server \
  -hf yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF:Q4_K_M \
  --alias "claude-fable5" \
  --ctx-size 16384 \
  --n-gpu-layers 99 \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  -fa on --no-mmap \
  --temp 1.0 --top-p 0.95 --top-k 64 \
  --host 0.0.0.0 --port 18080
step 2 — use-fable5.sh
# claude-fable5 → your local llama-server
export ANTHROPIC_BASE_URL=http://localhost:18080
export ANTHROPIC_AUTH_TOKEN="local"     # unvalidated, any string
export ANTHROPIC_API_KEY=""

# the --alias you served must match these exactly
export ANTHROPIC_MODEL=claude-fable5
export ANTHROPIC_DEFAULT_SONNET_MODEL=claude-fable5
export ANTHROPIC_DEFAULT_HAIKU_MODEL=claude-fable5
export ANTHROPIC_DEFAULT_OPUS_MODEL=claude-fable5
export CLAUDE_CODE_SUBAGENT_MODEL=claude-fable5

claude
i
The --alias string and ANTHROPIC_MODEL must match character-for-character — that single string is the whole handshake. This model is Python/algorithmic-focused and not safety-aligned, so add your own guardrails for anything production.
04

Naming them in the /model picker

The sourceable scripts above give you one launch command per model. To make the aliases visible and selectable inside a running session, register them as custom model options. Claude Code skips validation here, so the name can be any string your endpoint accepts.

register a labeled option
# adds a tidy entry to the /model picker
export ANTHROPIC_CUSTOM_MODEL_OPTION=claude-glm
export ANTHROPIC_CUSTOM_MODEL_OPTION_NAME="claude-glm"
export ANTHROPIC_CUSTOM_MODEL_OPTION_DESCRIPTION="GLM 5.2 · self-hosted vLLM"
i
The _NAME and _DESCRIPTION overrides take effect when ANTHROPIC_BASE_URL points at a gateway or self-hosted server. They have no effect when you're connected directly to api.anthropic.com. Need several versions of one family mapped to distinct IDs? Use the modelOverrides setting in settings.json instead.

Keep one script per model — use-glm.sh, use-fable5.sh — and switch with a single source. That's your model selector.

switching, day to day
# repo-scale, 1M-context work on the cluster
source ./use-glm.sh

# private, offline session on the laptop
source ./use-fable5.sh

# back to the real thing
unset ANTHROPIC_BASE_URL ANTHROPIC_MODEL
claude
05

When it doesn't connect

The failures you'll actually hit on self-hosted endpoints, and the one-line fix for each.

+Connection refused on launch
The inference server isn't up yet — the most common issue. Confirm vllm serve (or llama-server) is answering at its port before launching Claude Code. For the cluster, the FP8 download and 8-GPU shard load take minutes; wait for the ready log line.
+404 Not Found on background calls
Claude Code is still firing the haiku / sonnet / opus aliases, which resolve to Anthropic IDs your endpoint doesn't serve. Map all of them to your --served-model-name (or --alias) with the ANTHROPIC_DEFAULT_*_MODEL variables shown above.
+Throughput collapses under load
Two usual causes on a GLM cluster. First, the per-request system-prompt hash defeating prefix caching — fix with vLLM > 0.17.1 or CLAUDE_CODE_ATTRIBUTION_HEADER=0. Second, KV-cache saturation: watch vllm:gpu_cache_usage_perc on /metrics and back off once you cross ~90% sustained.
+It answers, but ignores tools or stalls mid-task
Claude Code's agentic loop was tuned for Claude's tool-calling. Open models can struggle with long read → think → edit chains. For GLM, run /effort max and confirm tool calling is enabled in your serving stack. For the on-device model, make sure you're on a recent llama.cpp build that supports the gemma4_unified architecture, and scope tasks smaller.
+Stuck on Anthropic even after setting the URL
A leftover ANTHROPIC_API_KEY can override your endpoint token. Set ANTHROPIC_AUTH_TOKEN for self-hosted endpoints (vLLM accepts a dummy value). To fully reset, unset ANTHROPIC_BASE_URL and relaunch.