Z.AI Users Manual

As of

A practical guide to GLM-5.1 — Zhipu AI / Z.AI's open-source flagship released April 8, 2026 under MIT license. 745B-parameter MoE with 44B active, 200K context, DeepSeek Sparse Attention. The first frontier model from China's first publicly-traded AI company.

🎈 ELI5

Z.AI (formerly Zhipu AI) is a major Beijing-based AI lab — and notably China's first publicly-traded AI company. Their GLM family is one of the world's most active open-source frontier lineups: GLM-4.5 (July 2025), GLM-5 (Feb 2026), and GLM-5.1 (April 2026, open-source).

The headline: GLM-5.1 is a 745B-parameter MoE with 44B active per token, 200K context, MIT-licensed open weights. It also showed up on attap.ai in two flavors — Z.AI's hosted version, and a faster Cerebras-hosted version (GLM 4.7) running on Cerebras's wafer-scale silicon.

Getting started in 60 seconds

  1. Pick your door: chat.z.ai for chat, docs.z.ai for the API, huggingface.co/zai-org for open weights.
  2. Sign in with email; the API has its own console.
  3. Pick the model: glm-5.1 for the open flagship, glm-5 for the larger original, glm-4.5-air for the cheaper compact tier.
  4. Lean on the open-source path. Z.AI's most distinctive bet is permissive open licensing — if you have GPUs, self-host costs less than hosted at scale.

Which Z.AI surface should I use?

chat.z.ai

Free consumer chat

  • Free tier with rate limits
  • English + Chinese first-class
  • Web search, file upload

Z.AI API (docs.z.ai)

Developer API

  • OpenAI-compatible chat completions
  • Pay-as-you-go
  • Function calling, JSON mode

Open weights / partners

HF, Cerebras, attap.ai

  • MIT license — fully permissive
  • Cerebras-hosted variant runs faster on wafer-scale
  • OpenRouter / DeepInfra / Together host hosted variants
  • Self-host for compliance / data residency

Prompt fundamentals (GLM edition)

  • Use the 200K context deliberately. Smaller than the 1M+ tier, big enough for the vast majority of real workloads. DeepSeek Sparse Attention keeps long-text performance high.
  • Treat GLM-5.1 as a coding/agentic peer to V4 Pro. Z.AI explicitly positions GLM-5 as "from vibe coding to agentic engineering."
  • Open-weights is the unique value. If you can self-host, GLM gives you a frontier-class capability with no per-token bill.
Why "publicly-traded AI" matters Z.AI (Zhipu) was the first major AI lab in China to go public. That changes incentives — quarterly disclosures, more transparency on benchmarks and costs, and clearer documentation than typical Chinese AI shops. Practical effect: GLM ships with more upfront docs than most peers.
🎈 ELI5

Z.AI's lineup has 4 active members: GLM-5.1 (open, April 2026, the flagship), GLM-5 (Feb 2026, the larger ancestor), GLM-4.5 (July 2025, MoE), and GLM-4.5-Air (compact tier, 106B/12B). All MIT-licensed open weights.

Current GLM lineup

ModelTotal / active paramsContextReleasedNotes
GLM-5.1 flagship · open ~745B / ~44B (MoE, 256 experts, 8 active) 200K 2026-04-08 (open-source release; subscription late March) Open-source (MIT). DeepSeek Sparse Attention. "Vibe coding to agentic engineering."
GLM-5 ~745B / ~40-44B (MoE) 200K 2026-02-11 Original GLM-5; pre-training data 28.5T.
GLM-4.5 355B / 32B (MoE) Long-context 2025-07 Previous flagship; agentic-tuned.
GLM-4.5-Air 106B / 12B (MoE) Long-context 2025-07 Compact tier; cheaper inference.

GLM-5.1 — deep dive

AreaWhat GLM-5.1 does
Architecture~745B total / 44B active via MoE (256 experts, 8 activated per token, ~5.9% sparsity).
Pre-training data28.5 trillion tokens (up from 23T on GLM-4.5).
Context200,000 input tokens.
AttentionDeepSeek Sparse Attention integrated for the first time — keeps long-text quality lossless while reducing serving cost.
LicenseMIT — most permissive available. Commercial use, fine-tuning, redistribution all allowed.
Positioning"From vibe coding to agentic engineering." Strong on agentic + reasoning + Chinese.
GLM-5.1 vs DeepSeek V4 Pro vs Qwen 3.6 Max Three Chinese open-weights frontier candidates as of mid-2026. DeepSeek V4 Pro wins on raw MoE scale (1.6T total / 49B active) and "agentic-coding open SOTA." GLM-5.1 wins on smaller-but-still-frontier cost-efficiency and DSA. Qwen 3.6 Max is proprietary; the open Qwen tier sits a step below in single-model capability but wins on multimodal breadth. Run your own evals.

Release timeline

DateReleaseWhat changed
2019Zhipu foundedSpun out of Tsinghua University.
2024GLM-4 seriesOpen-source dense models; multilingual coverage.
2025-07GLM-4.5 + GLM-4.5-AirFirst MoE-class flagship (355B/32B) + compact tier (106B/12B).
2026-02-11GLM-5745B MoE, 200K context, 28.5T pre-training data.
2026-late-MarGLM-5.1 (subscription)Subscriber preview of GLM-5.1.
2026-04-08GLM-5.1 (open-source)MIT-licensed open weights on Hugging Face.

Pricing

GLM-5.1 hosted pricing on Z.AI is pay-as-you-go; rates vary across the official API, Cerebras-hosted variants (faster, slightly different pricing), and third-party hosts. Verify in the Z.AI console. Self-host eliminates per-token cost entirely — pay only for compute.

Open-weights

GLM-5.1 weights live at huggingface.co/zai-org under MIT. Practical paths:

  • vLLM / SGLang — production GPU serving with DSA support.
  • Cerebras — Z.AI partners with Cerebras to host GLM on wafer-scale silicon for very fast inference (sometimes branded "GLM 4.7" on intermediary platforms).
  • llama.cpp / quantized — community quantizations exist; viable on smaller hardware.
  • Vertex / Bedrock / Azure — not first-party hosted; check third-party providers.

chat.z.ai — setup

  1. Visit chat.z.ai and sign in.
  2. Default model is the latest open GLM tier; subscription unlocks the higher tiers.
  3. Toggle web search and file upload as needed.

Optimal prompts for chat.z.ai

Agentic engineering task
Agentic engineering You are running in an agentic engineering loop. The task: [describe]. Process: PLAN → EXECUTE → VERIFY. - PLAN: read relevant code, restate the task, list ordered steps with risks. - EXECUTE: walk through each step. Show diffs for any code change. - VERIFY: re-read your own changes; confirm the plan; surface anything I should test. Be specific about file paths and exact code. Don't add features I didn't ask for.
Long-context document QA
Doc QA The document below is your only source. Cite section numbers for every claim. If the document doesn't answer, say so explicitly. Document: """ [paste] """ Question: [question]

Account & keys

  1. Visit docs.z.ai; sign in to the API console.
  2. Add payment; pay-as-you-go in USD.
  3. Generate API key. Env vars only.

First API call (OpenAI-compatible)

Python — OpenAI SDK from openai import OpenAI client = OpenAI( api_key="YOUR_ZAI_KEY", base_url="https://api.z.ai/v1", ) resp = client.chat.completions.create( model="glm-5.1", messages=[ {"role": "system", "content": "You are a senior agentic engineer."}, {"role": "user", "content": "Walk me through the trade-offs of MoE sparsity at 5.9% vs 3%."}, ], ) print(resp.choices[0].message.content)

Self-host (open weights)

Pull from huggingface.co/zai-org. vLLM and SGLang both support GLM-5 family with DSA. For wafer-scale latency, partner with Cerebras (their hosted GLM variant runs noticeably faster than commodity GPU serving).

Use-case library

Vibe coding → production engineering
Vibe → prod I have a working prototype that started as vibe-coded glue. Help me harden it for production. Steps: 1. Audit — read the code; list the top 8 risks (security, perf, correctness, ops). 2. Stabilize — fix what's flatly broken; minimum diff. 3. Test — propose tests for each risk you found. 4. Operationalize — what monitoring / logging / rollback hooks are missing? Don't rewrite from scratch. Preserve working behavior.
Bilingual technical writing (EN + ZH)
Bilingual writing Write the technical spec below in BOTH English and Mandarin Chinese. Constraints: - Keep code snippets and identifiers identical across versions. - Localize idioms, not just words. - Use traditional formal register in Chinese (书面语). - Match section structure 1:1 so they can sit side-by-side. Spec to translate: """ [paste] """

Patterns

"Self-host first, iterate later"

If on-prem GPU access is feasible, prototype on the open MIT weights. You won't run into surprise rate limits or pricing changes mid-build. Move to hosted Z.AI / Cerebras only when load economics justify the switch.

"DSA narrows the gap"

DeepSeek Sparse Attention lets GLM-5.1 give you long-context performance closer to lossless than older sparse-attention schemes. Don't chunk inputs reflexively — try the full window first.