Z.AI Users Manual
As of …A practical guide to GLM-5.1 — Zhipu AI / Z.AI's open-source flagship released April 8, 2026 under MIT license. 745B-parameter MoE with 44B active, 200K context, DeepSeek Sparse Attention. The first frontier model from China's first publicly-traded AI company.
Z.AI (formerly Zhipu AI) is a major Beijing-based AI lab — and notably China's first publicly-traded AI company. Their GLM family is one of the world's most active open-source frontier lineups: GLM-4.5 (July 2025), GLM-5 (Feb 2026), and GLM-5.1 (April 2026, open-source).
The headline: GLM-5.1 is a 745B-parameter MoE with 44B active per token, 200K context, MIT-licensed open weights. It also showed up on attap.ai in two flavors — Z.AI's hosted version, and a faster Cerebras-hosted version (GLM 4.7) running on Cerebras's wafer-scale silicon.
Getting started in 60 seconds
- Pick your door: chat.z.ai for chat, docs.z.ai for the API, huggingface.co/zai-org for open weights.
- Sign in with email; the API has its own console.
- Pick the model:
glm-5.1for the open flagship,glm-5for the larger original,glm-4.5-airfor the cheaper compact tier. - Lean on the open-source path. Z.AI's most distinctive bet is permissive open licensing — if you have GPUs, self-host costs less than hosted at scale.
Which Z.AI surface should I use?
chat.z.ai
Free consumer chat
- Free tier with rate limits
- English + Chinese first-class
- Web search, file upload
Z.AI API (docs.z.ai)
Developer API
- OpenAI-compatible chat completions
- Pay-as-you-go
- Function calling, JSON mode
Open weights / partners
HF, Cerebras, attap.ai
- MIT license — fully permissive
- Cerebras-hosted variant runs faster on wafer-scale
- OpenRouter / DeepInfra / Together host hosted variants
- Self-host for compliance / data residency
Prompt fundamentals (GLM edition)
- Use the 200K context deliberately. Smaller than the 1M+ tier, big enough for the vast majority of real workloads. DeepSeek Sparse Attention keeps long-text performance high.
- Treat GLM-5.1 as a coding/agentic peer to V4 Pro. Z.AI explicitly positions GLM-5 as "from vibe coding to agentic engineering."
- Open-weights is the unique value. If you can self-host, GLM gives you a frontier-class capability with no per-token bill.
Z.AI's lineup has 4 active members: GLM-5.1 (open, April 2026, the flagship), GLM-5 (Feb 2026, the larger ancestor), GLM-4.5 (July 2025, MoE), and GLM-4.5-Air (compact tier, 106B/12B). All MIT-licensed open weights.
Current GLM lineup
| Model | Total / active params | Context | Released | Notes |
|---|---|---|---|---|
| GLM-5.1 flagship · open | ~745B / ~44B (MoE, 256 experts, 8 active) | 200K | 2026-04-08 (open-source release; subscription late March) | Open-source (MIT). DeepSeek Sparse Attention. "Vibe coding to agentic engineering." |
| GLM-5 | ~745B / ~40-44B (MoE) | 200K | 2026-02-11 | Original GLM-5; pre-training data 28.5T. |
| GLM-4.5 | 355B / 32B (MoE) | Long-context | 2025-07 | Previous flagship; agentic-tuned. |
| GLM-4.5-Air | 106B / 12B (MoE) | Long-context | 2025-07 | Compact tier; cheaper inference. |
GLM-5.1 — deep dive
| Area | What GLM-5.1 does |
|---|---|
| Architecture | ~745B total / 44B active via MoE (256 experts, 8 activated per token, ~5.9% sparsity). |
| Pre-training data | 28.5 trillion tokens (up from 23T on GLM-4.5). |
| Context | 200,000 input tokens. |
| Attention | DeepSeek Sparse Attention integrated for the first time — keeps long-text quality lossless while reducing serving cost. |
| License | MIT — most permissive available. Commercial use, fine-tuning, redistribution all allowed. |
| Positioning | "From vibe coding to agentic engineering." Strong on agentic + reasoning + Chinese. |
Release timeline
| Date | Release | What changed |
|---|---|---|
| 2019 | Zhipu founded | Spun out of Tsinghua University. |
| 2024 | GLM-4 series | Open-source dense models; multilingual coverage. |
| 2025-07 | GLM-4.5 + GLM-4.5-Air | First MoE-class flagship (355B/32B) + compact tier (106B/12B). |
| 2026-02-11 | GLM-5 | 745B MoE, 200K context, 28.5T pre-training data. |
| 2026-late-Mar | GLM-5.1 (subscription) | Subscriber preview of GLM-5.1. |
| 2026-04-08 | GLM-5.1 (open-source) | MIT-licensed open weights on Hugging Face. |
Pricing
GLM-5.1 hosted pricing on Z.AI is pay-as-you-go; rates vary across the official API, Cerebras-hosted variants (faster, slightly different pricing), and third-party hosts. Verify in the Z.AI console. Self-host eliminates per-token cost entirely — pay only for compute.
Open-weights
GLM-5.1 weights live at huggingface.co/zai-org under MIT. Practical paths:
- vLLM / SGLang — production GPU serving with DSA support.
- Cerebras — Z.AI partners with Cerebras to host GLM on wafer-scale silicon for very fast inference (sometimes branded "GLM 4.7" on intermediary platforms).
- llama.cpp / quantized — community quantizations exist; viable on smaller hardware.
- Vertex / Bedrock / Azure — not first-party hosted; check third-party providers.
chat.z.ai — setup
- Visit chat.z.ai and sign in.
- Default model is the latest open GLM tier; subscription unlocks the higher tiers.
- Toggle web search and file upload as needed.
Optimal prompts for chat.z.ai
Agentic engineering task
Long-context document QA
Account & keys
- Visit docs.z.ai; sign in to the API console.
- Add payment; pay-as-you-go in USD.
- Generate API key. Env vars only.
First API call (OpenAI-compatible)
Self-host (open weights)
Pull from huggingface.co/zai-org. vLLM and SGLang both support GLM-5 family with DSA. For wafer-scale latency, partner with Cerebras (their hosted GLM variant runs noticeably faster than commodity GPU serving).
Use-case library
Vibe coding → production engineering
Bilingual technical writing (EN + ZH)
Patterns
"Self-host first, iterate later"
If on-prem GPU access is feasible, prototype on the open MIT weights. You won't run into surprise rate limits or pricing changes mid-build. Move to hosted Z.AI / Cerebras only when load economics justify the switch.
"DSA narrows the gap"
DeepSeek Sparse Attention lets GLM-5.1 give you long-context performance closer to lossless than older sparse-attention schemes. Don't chunk inputs reflexively — try the full window first.