Claude Code speaks the Anthropic Messages API. Point it at your own infrastructure and it routes every request there — no forks, no cloud. This wires up two models you host yourself: GLM 5.2 on a multi-GPU cluster for repo-scale work, and a 12B GGUF on a single laptop — each under its own named alias in the /model picker.
Both models in this guide come down to the same three switches. The base URL moves the
destination; the auth token authenticates against it; the custom-model option adds your
alias to the picker. Set them per shell, or drop them in
~/.claude/settings.json to make them stick.
ANTHROPIC_API_KEY, when pointing at your
own base URL — and leave ANTHROPIC_API_KEY empty so Claude Code doesn't silently
fall back to Anthropic auth. vLLM accepts a dummy value for both.GLM 5.2 is Z.ai's flagship coding model (released mid-June 2026): a 753B-parameter MoE with a usable 1M-token context, MIT-licensed open weights, and two reasoning tiers. The FP8 checkpoint is ~750 GB — multi-GPU server territory — but once it's up, vLLM exposes the Anthropic Messages API directly, so Claude Code talks to it with zero translation layer.
# ~750 GB of FP8 safetensors — budget 30-60 min on 10 GbE
huggingface-cli download zai-org/GLM-5.2-FP8 \
--local-dir /models/glm-5.2-fp8 \
--local-dir-use-symlinks False
# shard the weights across 8 GPUs; serve the Anthropic API
vllm serve "zai-org/GLM-5.2-FP8" \
--tensor-parallel-size 8 \
--max-model-len 262144 \
--kv-cache-dtype fp8 \
--enable-prefix-caching \
--served-model-name claude-glm \
--host 0.0.0.0 --port 8000
# claude-glm → your vLLM cluster
export ANTHROPIC_BASE_URL=http://glm-host:8000
export ANTHROPIC_AUTH_TOKEN=dummy # vLLM doesn't validate it
export ANTHROPIC_API_KEY=dummy
# the --served-model-name you set above must match these
export ANTHROPIC_MODEL=claude-glm
export ANTHROPIC_DEFAULT_OPUS_MODEL=claude-glm
export ANTHROPIC_DEFAULT_SONNET_MODEL=claude-glm
export ANTHROPIC_DEFAULT_HAIKU_MODEL=claude-glm
export CLAUDE_CODE_SUBAGENT_MODEL=claude-glm
# let the agent use the full million-token window
export CLAUDE_CODE_AUTO_COMPACT_WINDOW=1000000
claude
"CLAUDE_CODE_ATTRIBUTION_HEADER": "0" to the env block of
settings.json. Inside a session, run
/effort max to engage GLM 5.2's deepest reasoning tier.Not everyone has eight H200s. gemma-4-12B-coder (the Fable 5 × Composer 2.5 distill) is a 12B coding fine-tune that fits in ~4.5 GB of VRAM at Q2 and runs on a single laptop. Recent llama-server builds expose the same Anthropic-compatible endpoint, so the wiring is identical — just a different host.
| Quant | Size | Best for |
|---|---|---|
| Q2_K | 4.83 GB | tiniest — runs almost anywhere |
| Q3_K_M | 6.09 GB | good for 8 GB VRAM |
| Q4_K_M | 7.38 GB | the sweet spot — recommended |
| Q6_K | 9.79 GB | near-lossless |
| Q8_0 | 12.7 GB | basically full quality |
# grab a recent llama.cpp — this is the gemma4_unified arch,
# older builds won't load it
brew install llama.cpp
# pull the recommended quant from the Hub and serve it.
# --alias is the name Claude Code will send in each request.
llama-server \
-hf yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF:Q4_K_M \
--alias "claude-fable5" \
--ctx-size 16384 \
--n-gpu-layers 99 \
--cache-type-k q8_0 --cache-type-v q8_0 \
-fa on --no-mmap \
--temp 1.0 --top-p 0.95 --top-k 64 \
--host 0.0.0.0 --port 18080
# claude-fable5 → your local llama-server
export ANTHROPIC_BASE_URL=http://localhost:18080
export ANTHROPIC_AUTH_TOKEN="local" # unvalidated, any string
export ANTHROPIC_API_KEY=""
# the --alias you served must match these exactly
export ANTHROPIC_MODEL=claude-fable5
export ANTHROPIC_DEFAULT_SONNET_MODEL=claude-fable5
export ANTHROPIC_DEFAULT_HAIKU_MODEL=claude-fable5
export ANTHROPIC_DEFAULT_OPUS_MODEL=claude-fable5
export CLAUDE_CODE_SUBAGENT_MODEL=claude-fable5
claude
The sourceable scripts above give you one launch command per model. To make the aliases visible and selectable inside a running session, register them as custom model options. Claude Code skips validation here, so the name can be any string your endpoint accepts.
# adds a tidy entry to the /model picker
export ANTHROPIC_CUSTOM_MODEL_OPTION=claude-glm
export ANTHROPIC_CUSTOM_MODEL_OPTION_NAME="claude-glm"
export ANTHROPIC_CUSTOM_MODEL_OPTION_DESCRIPTION="GLM 5.2 · self-hosted vLLM"
_NAME and _DESCRIPTION overrides take effect when
ANTHROPIC_BASE_URL points at a gateway or self-hosted server. They have no effect
when you're connected directly to api.anthropic.com.
Need several versions of one family mapped to distinct IDs? Use the
modelOverrides setting in settings.json instead.Keep one script per model — use-glm.sh, use-fable5.sh — and switch with a single source. That's your model selector.
# repo-scale, 1M-context work on the cluster
source ./use-glm.sh
# private, offline session on the laptop
source ./use-fable5.sh
# back to the real thing
unset ANTHROPIC_BASE_URL ANTHROPIC_MODEL
claude
The failures you'll actually hit on self-hosted endpoints, and the one-line fix for each.
vllm serve (or llama-server) is answering at its port before launching
Claude Code. For the cluster, the FP8 download and 8-GPU shard load take minutes; wait for the
ready log line.haiku / sonnet / opus
aliases, which resolve to Anthropic IDs your endpoint doesn't serve. Map all of them to your
--served-model-name (or --alias) with the
ANTHROPIC_DEFAULT_*_MODEL variables shown above.CLAUDE_CODE_ATTRIBUTION_HEADER=0. Second,
KV-cache saturation: watch vllm:gpu_cache_usage_perc on /metrics and back off
once you cross ~90% sustained./effort max and confirm tool
calling is enabled in your serving stack. For the on-device model, make sure you're on a recent
llama.cpp build that supports the gemma4_unified architecture, and scope tasks smaller.ANTHROPIC_API_KEY can override your endpoint token. Set
ANTHROPIC_AUTH_TOKEN for self-hosted endpoints (vLLM accepts a dummy value).
To fully reset, unset ANTHROPIC_BASE_URL and relaunch.