Z.AI Users Manual — GLM-5.1 / Zhipu AI

🎈 ELI5

Z.AI (formerly Zhipu AI) is a major Beijing-based AI lab — and notably China's first publicly-traded AI company. Their GLM family is one of the world's most active open-source frontier lineups: GLM-4.5 (July 2025), GLM-5 (Feb 2026), and GLM-5.1 (April 2026, open-source).

The headline: GLM-5.1 is a 745B-parameter MoE with 44B active per token, 200K context, MIT-licensed open weights. It also showed up on attap.ai in two flavors — Z.AI's hosted version, and a faster Cerebras-hosted version (GLM 4.7) running on Cerebras's wafer-scale silicon.

Getting started in 60 seconds

Pick your door: chat.z.ai for chat, docs.z.ai for the API, huggingface.co/zai-org for open weights.
Sign in with email; the API has its own console.
Pick the model: glm-5.1 for the open flagship, glm-5 for the larger original, glm-4.5-air for the cheaper compact tier.
Lean on the open-source path. Z.AI's most distinctive bet is permissive open licensing — if you have GPUs, self-host costs less than hosted at scale.

Which Z.AI surface should I use?

chat.z.ai

Free consumer chat

Free tier with rate limits
English + Chinese first-class
Web search, file upload

Z.AI API (docs.z.ai)

Developer API

OpenAI-compatible chat completions
Pay-as-you-go
Function calling, JSON mode

Open weights / partners

HF, Cerebras, attap.ai

MIT license — fully permissive
Cerebras-hosted variant runs faster on wafer-scale
OpenRouter / DeepInfra / Together host hosted variants
Self-host for compliance / data residency

Prompt fundamentals (GLM edition)

Use the 200K context deliberately. Smaller than the 1M+ tier, big enough for the vast majority of real workloads. DeepSeek Sparse Attention keeps long-text performance high.
Treat GLM-5.1 as a coding/agentic peer to V4 Pro. Z.AI explicitly positions GLM-5 as "from vibe coding to agentic engineering."
Open-weights is the unique value. If you can self-host, GLM gives you a frontier-class capability with no per-token bill.

Why "publicly-traded AI" matters Z.AI (Zhipu) was the first major AI lab in China to go public. That changes incentives — quarterly disclosures, more transparency on benchmarks and costs, and clearer documentation than typical Chinese AI shops. Practical effect: GLM ships with more upfront docs than most peers.

🎈 ELI5

Z.AI's lineup has 4 active members: GLM-5.1 (open, April 2026, the flagship), GLM-5 (Feb 2026, the larger ancestor), GLM-4.5 (July 2025, MoE), and GLM-4.5-Air (compact tier, 106B/12B). All MIT-licensed open weights.

Current GLM lineup

Model	Total / active params	Context	Released	Notes
GLM-5.1 flagship · open	~745B / ~44B (MoE, 256 experts, 8 active)	200K	2026-04-08 (open-source release; subscription late March)	Open-source (MIT). DeepSeek Sparse Attention. "Vibe coding to agentic engineering."
GLM-5	~745B / ~40-44B (MoE)	200K	2026-02-11	Original GLM-5; pre-training data 28.5T.
GLM-4.5	355B / 32B (MoE)	Long-context	2025-07	Previous flagship; agentic-tuned.
GLM-4.5-Air	106B / 12B (MoE)	Long-context	2025-07	Compact tier; cheaper inference.

GLM-5.1 — deep dive

Area	What GLM-5.1 does
Architecture	~745B total / 44B active via MoE (256 experts, 8 activated per token, ~5.9% sparsity).
Pre-training data	28.5 trillion tokens (up from 23T on GLM-4.5).
Context	200,000 input tokens.
Attention	DeepSeek Sparse Attention integrated for the first time — keeps long-text quality lossless while reducing serving cost.
License	MIT — most permissive available. Commercial use, fine-tuning, redistribution all allowed.
Positioning	"From vibe coding to agentic engineering." Strong on agentic + reasoning + Chinese.

GLM-5.1 vs DeepSeek V4 Pro vs Qwen 3.6 Max Three Chinese open-weights frontier candidates as of mid-2026. DeepSeek V4 Pro wins on raw MoE scale (1.6T total / 49B active) and "agentic-coding open SOTA." GLM-5.1 wins on smaller-but-still-frontier cost-efficiency and DSA. Qwen 3.6 Max is proprietary; the open Qwen tier sits a step below in single-model capability but wins on multimodal breadth. Run your own evals.

Release timeline

Date	Release	What changed
2019	Zhipu founded	Spun out of Tsinghua University.
2024	GLM-4 series	Open-source dense models; multilingual coverage.
2025-07	GLM-4.5 + GLM-4.5-Air	First MoE-class flagship (355B/32B) + compact tier (106B/12B).
2026-02-11	GLM-5	745B MoE, 200K context, 28.5T pre-training data.
2026-late-Mar	GLM-5.1 (subscription)	Subscriber preview of GLM-5.1.
2026-04-08	GLM-5.1 (open-source)	MIT-licensed open weights on Hugging Face.

Pricing

GLM-5.1 hosted pricing on Z.AI is pay-as-you-go; rates vary across the official API, Cerebras-hosted variants (faster, slightly different pricing), and third-party hosts. Verify in the Z.AI console. Self-host eliminates per-token cost entirely — pay only for compute.

Open-weights

GLM-5.1 weights live at huggingface.co/zai-org under MIT. Practical paths:

vLLM / SGLang — production GPU serving with DSA support.
Cerebras — Z.AI partners with Cerebras to host GLM on wafer-scale silicon for very fast inference (sometimes branded "GLM 4.7" on intermediary platforms).
llama.cpp / quantized — community quantizations exist; viable on smaller hardware.
Vertex / Bedrock / Azure — not first-party hosted; check third-party providers.

chat.z.ai — setup

Visit chat.z.ai and sign in.
Default model is the latest open GLM tier; subscription unlocks the higher tiers.
Toggle web search and file upload as needed.

Optimal prompts for chat.z.ai

Agentic engineering task

Agentic engineering You are running in an agentic engineering loop. The task: [describe]. Process: PLAN → EXECUTE → VERIFY. - PLAN: read relevant code, restate the task, list ordered steps with risks. - EXECUTE: walk through each step. Show diffs for any code change. - VERIFY: re-read your own changes; confirm the plan; surface anything I should test. Be specific about file paths and exact code. Don't add features I didn't ask for.

Long-context document QA

Doc QA The document below is your only source. Cite section numbers for every claim. If the document doesn't answer, say so explicitly. Document: """ [paste] """ Question: [question]

Account & keys

Visit docs.z.ai; sign in to the API console.
Add payment; pay-as-you-go in USD.
Generate API key. Env vars only.

First API call (OpenAI-compatible)

Python — OpenAI SDK from openai import OpenAI client = OpenAI( api_key="YOUR_ZAI_KEY", base_url="https://api.z.ai/v1", ) resp = client.chat.completions.create( model="glm-5.1", messages=[ {"role": "system", "content": "You are a senior agentic engineer."}, {"role": "user", "content": "Walk me through the trade-offs of MoE sparsity at 5.9% vs 3%."}, ], ) print(resp.choices[0].message.content)

Self-host (open weights)

Pull from huggingface.co/zai-org. vLLM and SGLang both support GLM-5 family with DSA. For wafer-scale latency, partner with Cerebras (their hosted GLM variant runs noticeably faster than commodity GPU serving).

Use-case library

Vibe coding → production engineering

Vibe → prod I have a working prototype that started as vibe-coded glue. Help me harden it for production. Steps: 1. Audit — read the code; list the top 8 risks (security, perf, correctness, ops). 2. Stabilize — fix what's flatly broken; minimum diff. 3. Test — propose tests for each risk you found. 4. Operationalize — what monitoring / logging / rollback hooks are missing? Don't rewrite from scratch. Preserve working behavior.

Bilingual technical writing (EN + ZH)

Bilingual writing Write the technical spec below in BOTH English and Mandarin Chinese. Constraints: - Keep code snippets and identifiers identical across versions. - Localize idioms, not just words. - Use traditional formal register in Chinese (书面语). - Match section structure 1:1 so they can sit side-by-side. Spec to translate: """ [paste] """

Patterns

"Self-host first, iterate later"

If on-prem GPU access is feasible, prototype on the open MIT weights. You won't run into surprise rate limits or pricing changes mid-build. Move to hosted Z.AI / Cerebras only when load economics justify the switch.

"DSA narrows the gap"

DeepSeek Sparse Attention lets GLM-5.1 give you long-context performance closer to lossless than older sparse-attention schemes. Don't chunk inputs reflexively — try the full window first.