Alibaba Users Manual

As of …

A practical guide to Alibaba's AI lineup — Qwen 3.6 Max for agentic + reasoning + coding, Wan 2.7 for video, HappyHorse 1.0 for image-to-video, Qwen Image for stills, Qwen Omni for full-modality work, and the open-weights Qwen family for self-host.

🎈 ELI5

Alibaba is the parent company behind Qwen — China's most active frontier-AI brand, and one of the most prolific open-weights labs in the world. The Qwen family covers everything: text + reasoning (Qwen 3.6 Max, Plus, and open variants), image generation (Qwen Image), video generation (Wan 2.7, HappyHorse 1.0), and full multimodal (Qwen Omni for text+audio+image+video).

The April–May 2026 wave is the most aggressive push yet: Qwen 3.6 Max with 1M+ context for complex agentic workflows, Qwen 3.6 Plus for cost-efficient speed, Wan 2.7 for video, and HappyHorse 1.0 as a top-ranked image-to-video model.

Getting started in 60 seconds

Pick your door: chat.qwen.ai for the consumer chat app, Alibaba Cloud Model Studio for the developer console, huggingface.co/Qwen for open weights.
Sign in with an Alibaba Cloud account (developer) or Qwen account (consumer chat). The chat app is free with limits; the API is pay-as-you-go on Model Studio.
Pick a model: qwen3.6-max for hardest tasks and agentic work, qwen3.6-plus for fast/cheap, qwen3.5-omni for audio + multimodal. For video: wan2.7 or happyhorse-1. For image gen: qwen-image.
Tell Qwen what good looks like. Goal, audience, format, constraints. Qwen 3.6 Max responds especially well to structured agentic prompts — explicit plan-then-execute patterns.

Which Alibaba surface should I use?

chat.qwen.ai

Free consumer chat

Free with rate limits
Default model is the latest Qwen flagship
Web search, deep thinking, file upload
Image / video gen tools surfaced

Model Studio (API)

Alibaba Cloud developer console

OpenAI-compatible chat completions
DashScope SDK (native)
All Qwen + Wan + HappyHorse + Image models
SG / CN / international regions

Self-host (open weights)

Hugging Face / Ollama / vLLM

Qwen 3.6 (open variants), Qwen 3.5 series, Qwen 3.5 Omni
Run on your own GPUs / clouds
One of the deepest open-source families anywhere
Note: Qwen 3.6 Max is proprietary, not open-weighted

Prompt fundamentals (Qwen edition)

Three things that matter most when prompting Qwen models:

Lean into the agentic frame. Qwen 3.6 Max is explicitly tuned for autonomous agent workflows — app development, visual browsing, multi-step planning. Prompts that frame the task as "you are an agent that does X, here are your tools, here's the goal" outperform raw chat-style prompts on complex tasks.
Use the long context. 1M+ tokens means you can fit entire codebases, multi-document research bundles, or long visual sequences. Don't artificially chunk what you don't need to.
Pick the right tier. Qwen 3.6 Max for hard reasoning / coding / agentic / visual reasoning. Qwen 3.6 Plus for high-volume, latency-sensitive work where Max-grade quality isn't required. Qwen 3.5 Omni when you genuinely need audio + visual + text in one model.

The "60% cheaper, 8× faster" claim Qwen 3.5 (Feb 2026) shipped with a stated 60% price reduction and 8× speed-up over the prior generation. Qwen 3.6 Plus continues that trajectory — Alibaba's positioning is explicitly that "smaller models match larger predecessors." Expect aggressive cost-quality tradeoffs to keep moving in this direction.

🎈 ELI5

Alibaba's text/code/reasoning lineup is the Qwen family. Right now (May 2026) there are two flagship proprietary models: Qwen 3.6 Max (smartest, biggest context, agentic-tuned) and Qwen 3.6 Plus (fast and cheap). Underneath, there's a deep open-weights family — Qwen 3.6 (35B / 27B), Qwen 3.5 (up to 397B), and Qwen 3.5 Omni for full multimodal — all on Hugging Face.

If you only remember one thing: Qwen 3.6 Max is the agentic flagship. It's designed to act as an autonomous AI agent — run app-development tasks, browse visually, chain reasoning across million-token contexts.

The current Qwen lineup

As of 2026-05-05, Qwen 3.6 Max is Alibaba's flagship for complex agentic workflows. The lineup splits into proprietary (Max, Plus) and open-weights (3.6 / 3.5 / Omni) tiers, plus dedicated multimodal models (Wan, HappyHorse, Image).

Latest releases Qwen 3.6 Max (April/May 2026) — flagship proprietary model with 1M+ token context, designed for complex agent workflows, high-level coding, and visual reasoning. Qwen 3.6 Plus (April 2026) — high-performance, optimized for speed and cost. Qwen 3.5 series (Feb 2026) — both open-source and API, native multimodal (text + image + video), 60% cheaper / 8× faster than predecessors. See Qwen 3.6 Max deep dive below.

About these dates Dates pulled from Alibaba Cloud Model Studio release notes and Hugging Face Qwen collection. Model IDs shown are typical conventions; the exact deployment IDs in Model Studio may include region or version suffixes — confirm in the console before pinning.

Qwen text/reasoning lineup

Model	Type	Released	Best for	Context
Qwen 3.6 Max flagship	Proprietary	April/May 2026	Complex agent workflows, high-level coding, visual reasoning. Top of the lineup for hardest tasks.	1M+ tokens
Qwen 3.6 Plus cheap+fast	Proprietary	April 2026	High-performance, optimized for speed and cost-efficiency. Default for high-volume production.	Long context
Qwen 3.6 (open) open	Open weights	April 2026	Self-host. 35B-A3B (MoE) and 27B variants — image-text-to-text capable.	Long context
Qwen 3.5 Omni open · multimodal	Open weights	Feb 2026	Native text + audio + image + video in one model — the most-multimodal open model in the family.	Multimodal
Qwen 3.5 (397B-A17B) open	Open weights	Feb 2026	Largest open Qwen 3.5 — 397B total / 17B active MoE. 60% cheaper / 8× faster than 3.0-era predecessors.	Long context

Qwen 3.6 Max — deep dive

This is the model the rest of the manual is centered on. Qwen 3.6 Max is Alibaba's flagship as of April/May 2026, designed end-to-end for autonomous agent workflows.

Area	What Qwen 3.6 Max does
Context window	1M+ tokens — fits an entire codebase, a long research bundle, or a multi-document agentic task in a single context.
Agentic focus flagship capability	Designed to act as an autonomous AI agent — explicitly tuned for app development and visual browsing tasks. This is the headline capability and what differentiates Max from Plus.
High-level coding	Strong on real engineering tasks: refactors, multi-file edits, debugging, planning. Pairs with the agentic frame for end-to-end coding workflows.
Visual reasoning	Built-in vision understanding — read screenshots, diagrams, dashboards, document layouts. Combined with agentic frame, supports visual-browsing-style tasks.
Proprietary	Closed-weights — accessed via Model Studio API only. (Open-weights options sit at the Plus / 3.6-open / 3.5 tiers.)
Pricing	Pay-as-you-go on Alibaba Cloud Model Studio. Pricing varies by region (China, Singapore, international) — verify in the Model Studio console before locking in capacity.

The agentic angle is the main thing Qwen 3.6 Max is explicitly positioned as an agent model, not just a chat model. If you're using it for plain Q&A or simple summarization, you're under-using it (and over-paying vs Plus). Pull it out for tasks where the model needs to plan multi-step work, choose tools, navigate visually, or operate over very long contexts.

Qwen 3.6 Max vs the competition (positioning)

Model	Open?	Context	Headline strength
Qwen 3.6 Max	Proprietary	1M+	Agentic workflows, visual reasoning, app dev
Claude Opus 4.7	Proprietary	200K	Ambiguous reasoning, coding agent ecosystem
GPT-5.5	Proprietary	~272K-1M	Long-running goal completion, breadth
Gemini 3.1 Pro	Proprietary	2M	Largest context, abstract reasoning, multimodal
Grok 4.3	Proprietary	1M	Cheapest hosted frontier, real-time X
DeepSeek V4 Pro	Open (MIT)	1M	Open-weights agentic-coding SOTA

Verify pricing and benchmarks Public benchmark data for the proprietary 3.6 Max is thinner at launch than for US frontier models — most claims come from Alibaba's own release notes. Run your own evals on your own data before locking in production routing.

Release timeline (chronological)

Date	Release	What changed
2023-08	Qwen-7B / Qwen-14B	First public Qwen models. Open-weighted from day one.
2024-09	Qwen 2.5	Major capability lift; large open-weights coverage; coding-specialised variants.
2025-04	Qwen 3.0	Multimodal-native (text + image), longer context.
2026-02	Qwen 3.5 Series	Open-source + API. 60% cheaper / 8× faster than 3.0. Native multimodal (text/image/video). Includes Qwen 3.5 Omni for full audio + multimodal.
2026-04	Qwen 3.6 Plus	High-performance proprietary tier optimized for speed and cost.
2026-04	Wan 2.7	New video gen model added to Model Studio lineup.
2026-04	HappyHorse 1.0	Top-ranked image-to-video model — high-fidelity, realistic dynamic rendering.
2026-04 / 05	Qwen 3.6 Max	Current flagship. 1M+ context. Agentic focus. High-level coding + visual reasoning.

How to pick a Qwen model

Pick Qwen 3.6 Max when…

The task is agentic — multi-step planning, tool use, visual browsing.
App-development workflows where the model drives the work.
You need million-token context for codebase-wide reasoning.
Visual reasoning — screenshots, diagrams, dashboards as input.
Hard coding / refactor tasks across multiple files.

Pick Qwen 3.6 Plus when…

High-volume production where cost dominates.
Latency-sensitive UX — chat, classification, summarization.
Quality requirements are moderate; you don't need agentic depth.
You want Qwen-family consistency across hot and cold paths.

Pick Qwen 3.5 Omni when…

You need text + audio + image + video in one model.
Open weights matter — self-host or fine-tune.
Audio is part of the input or output (the only Qwen tier with first-party audio).
Building multimodal agents that span sensory channels.

Pick open Qwen 3.6 / 3.5 when…

Compliance / data residency — must run on your own infra.
Fine-tuning for a domain or task.
Cost floor below any hosted API.
Edge deployment (smaller variants).

Match the model to the task Don't default to Max for everything — it's tuned for agentic depth and is overkill for triage, classification, simple rewrites. For high-volume, low-complexity work, Plus or open Qwen 3.6 wins on cost and speed. Reserve Max for tasks that genuinely benefit from agentic planning, visual reasoning, or 1M+ context.

Pricing

Alibaba Cloud Model Studio is pay-as-you-go in USD or CNY depending on the region (China, Singapore, international). Headline characteristics of the Qwen pricing structure:

Qwen 3.5 Series shipped with a 60% price cut over the prior generation, and Qwen 3.6 continues that trajectory.
Qwen 3.6 Plus is positioned as the cost-efficient tier — the explicit "match larger predecessors at smaller cost" model.
Open-weights variants (Qwen 3.6 / 3.5) have zero per-token cost when self-hosted; pay only for compute.

Confirm rates in-console Pricing on Alibaba Cloud is region-dependent and updates frequently. Always verify the current per-1M-token rate for your region (China, Singapore, international) in the Model Studio console before pinning capacity assumptions.

Open-weights variants

Alibaba is one of the most active open-weights labs in the world. The current open Qwen family at huggingface.co/Qwen includes:

Series	Variants	Modalities
Qwen 3.6	35B-A3B (MoE), 27B, plus FP8 quantized	Image-Text-to-Text
Qwen 3.5	397B-A17B (largest open Qwen 3.5), Omni, plus SAE research variants	Text, Multimodal (Omni: text+audio+image+video)
Qwen 3	30B-A3B, 8B base + sparse autoencoder research variants	Text
Qwen Image 2512	Image generation from text	Text-to-Image

Why this matters DeepSeek V4 Pro is the most capable single open-weights model, but Alibaba ships the broadest open-weights family — text, multimodal, image gen, plus research variants. If "we want to standardize on one open vendor" is the question, Alibaba is uniquely positioned to be the answer for multi-modality teams.

🎈 ELI5

Alibaba doesn't just do text. They have four distinct multimodal models, each focused on a different media type: Wan 2.7 (text-to-video), HappyHorse 1.0 (image-to-video, top-ranked for fidelity), Qwen Image (text-to-image), and Qwen 3.5 Omni (audio + everything else).

This breadth is closer to Google's stack than to DeepSeek's text-only profile. If you need one vendor for text + image + video + audio, Alibaba is one of three real options (alongside Google and OpenAI).

All-modalities overview

🎬

Wan 2.7 — text-to-video

New video generation model added to Model Studio in April 2026. Native text-to-video pipeline; production-ready API.

🐎

HappyHorse 1.0 — image-to-video

Top-ranked image-to-video model focused on high-fidelity, realistic dynamic rendering. Released April 2026.

🎨

Qwen Image — text-to-image

Open-weights image-gen model on Hugging Face (Qwen Image 2512). Text-to-image generation; strong on Chinese-language prompts.

🔊

Qwen 3.5 Omni — full multimodal

Single open-weights model accepting text, audio, image, and video. The "everything" tier — supports cross-modal reasoning in one forward pass.

Wan 2.7 (video generation)

Wan is Alibaba's video-generation line. Wan 2.7 is the latest entry in Model Studio (April 2026).

Modalities: Text-to-video.
Surface: Available via Alibaba Cloud Model Studio.
Use cases: Marketing creative, product demos, education, social-media content.
Compared to: Veo 3 (Google) is more polished cinematically; Grok Imagine Video is cheaper. Wan 2.7 is the China-native option with first-party Alibaba Cloud distribution.

HappyHorse 1.0 (image-to-video)

Released April 2026. Distinct from Wan: HappyHorse takes an existing image as input and animates it.

Modality: Image-to-video — start from a still, get a video out.
Strength: High-fidelity, realistic dynamic rendering. Top-ranked among image-to-video models per Alibaba's release notes.
Use cases: Animating product photography, bringing static art to life, character animation.
Alternative use: Pair with Qwen Image (or any image gen) to go text → image → video in two steps.

Qwen Image (text-to-image)

Open-weights image-generation family on Hugging Face. The current generation is Qwen Image 2512.

Modality: Text-to-image.
Distribution: Open weights on Hugging Face; available via Model Studio API.
Strengths: Strong on Chinese-language prompts; competent on photorealism and stylized output.
Pair with: HappyHorse 1.0 to animate the output, or feed into Wan 2.7 for video continuation.

Qwen 3.5 Omni (audio + multimodal)

The "everything in one model" tier. Qwen 3.5 Omni is the only Qwen variant that natively handles audio input/output alongside the standard text + image + video.

Modalities: Text, audio, image, video — all in one open-weights model.
Distribution: Open weights on Hugging Face; demos available online and offline.
Use cases: Voice agents that can also see, multimodal accessibility, cross-modal search ("find the moment in this video where someone says X").
Tradeoff: Generalist — for hardest-tier text reasoning, Qwen 3.6 Max wins; Omni's edge is breadth across modalities.

Putting the multimodal stack together A common pattern: use Qwen 3.6 Max as the orchestrator brain → invoke Qwen Image for stills → animate with HappyHorse or Wan 2.7 for the final video. For voice products, route audio through Qwen 3.5 Omni. The Alibaba stack is one of the few places you can keep the entire pipeline single-vendor.

Practical workflows

Three end-to-end pipelines you can copy. Each names which Qwen model handles which step.

Workflow A — Marketing video (15s) from a product brief

Qwen 3.6 Max — plan
Send the product brief and target audience. Ask for: 3 still-frame prompts, motion prompts for each, voiceover script with timestamps. (Use the "Marketing video pipeline" template under Use-case library.)
Qwen Image — generate stills
Run each still-frame prompt to produce 3 hero frames at high resolution.
HappyHorse 1.0 — animate each still
Send each still + its motion prompt. Get 3 short animated clips.
Wan 2.7 — optional bridge shots
If you need transitions or B-roll between hero shots, generate them text-to-video.
Off-Alibaba — assemble
Stitch in any video editor; layer the voiceover. Total runtime 15s; total Qwen calls ~7 if you don't iterate.

Workflow B — Animate an existing product photo

Qwen 3.6 Max — direction
Upload the product photo. Use the "Image-to-video direction" template to get a tuned motion prompt for HappyHorse.
HappyHorse 1.0 — render
Send the photo + motion prompt; get the looped clip back.
Qwen 3.6 Max — review
Optionally send screenshots from the rendered clip back to Max and ask for QC notes — does the motion match brand guidelines? Anything to revise?

Workflow C — Multilingual product launch creative

Qwen 3.6 Max — base creative
Generate the headline, body copy, and CTA in your source language.
Qwen 3.6 Max — localize
Use the "Localize marketing copy" template (under Use-case library) for each target market — it goes beyond literal translation.
Qwen Image — localized stills
For markets where image text matters, regenerate stills with localized typography. (Note: text-in-image rendering is improving but verify each output.)
HappyHorse / Wan — localized motion
Animate the localized stills. The motion prompt itself usually doesn't need localization.

Cost note Video generation is the most expensive step in any of these workflows. Lock in stills you actually like (with Qwen Image) before animating with HappyHorse / Wan. Iterating at the still stage is much cheaper than iterating at the video stage.

🎈 ELI5

chat.qwen.ai is Alibaba's free consumer chat website. Sign in, start chatting. The default model is the latest Qwen flagship; specialized buttons in the input area let you toggle web search, deep reasoning ("Thinking"), and image / video generation tools.

It's free with rate limits — useful for testing prompts before wiring them into your product on the Model Studio API.

chat.qwen.ai — setup

Visit chat.qwen.ai and sign in (Qwen / Alibaba account).
Default model is the latest Qwen flagship. The chat surface routes to the current best Qwen automatically; specific model selection (Max vs Plus vs others) may be available via a model picker depending on region.
Toggles to know:
- Thinking — turn on chain-of-thought reasoning for hard tasks.
- Web search — ground responses in current web results.
- File upload — drop in PDFs, code, docs, images. With long context you can upload sizable files.
- Image / Video tools — generate images via Qwen Image, video via Wan or HappyHorse, depending on the surface area of the chat experience in your region.
Rate limits apply. For guaranteed throughput or production access, use Model Studio API.

Modes & tools — when to use each

Toggle	Turn it on for	Skip it for
Thinking	Math, multi-step reasoning, code review, agentic planning, ambiguous questions.	Quick chat, simple summaries, formatting tasks. Adds latency.
Web search	Current events, recent product info, time-sensitive queries, fact-checking.	Math, code, abstract reasoning that doesn't need fresh data.
Image gen	Visual ideation, marketing concepts, illustrative diagrams.	Tasks that don't need an image — wastes tokens and credits.
Video gen	Short marketing clips, animated stills, visual prototypes.	Anything where a static image suffices — video is much heavier on cost.

Optimal prompts for chat.qwen.ai

Qwen models reward structure. Below are tested templates for the chat surface.

Agentic app-dev plan

Agentic app dev You are an autonomous app-development agent. The user wants: [feature description]. Step 1 — PLAN. Read the relevant context. Restate the goal. List the changes you'll make in order, and identify risks or decisions that need user input. Step 2 — EXECUTE. Walk through each change. For UI work, describe the visual layout and behavior precisely. For backend work, name the files, functions, and data shapes. Step 3 — VERIFY. Re-read the plan against the executed changes. Identify gaps. Suggest tests. Be explicit and structured — assume your output will be acted on directly.

Visual reasoning over screenshots

Visual reasoning I'm uploading a screenshot of [a dashboard / a product page / a document]. Read it carefully and answer: 1. What is the main thing this view is communicating? 2. What are the 3 most important data points or actions on the page? 3. If a user wanted to [task], what's the path and what could go wrong? 4. Anything visually inconsistent, broken, or unclear? Cite specific elements (top-left, third row, etc.) so I can verify.

Long-context codebase review

Codebase review I've uploaded the entire codebase for [project]. Use the full context. Review for: 1. Architectural smells — cyclic deps, leaky abstractions, premature abstraction. 2. Correctness bugs — error handling gaps, off-by-one, race conditions. 3. Security issues — injection, authz, secrets in code. 4. Performance — N+1 queries, blocking I/O, allocations in loops. For each finding: file:line — severity (critical/high/medium/low) — one-sentence rationale — suggested fix. Skip nitpicks (style, naming) unless they obscure intent.

🎈 ELI5

The Alibaba Cloud Model Studio API is OpenAI-compatible — same request shape, just point at the Alibaba endpoint and pass qwen3.6-max (or your chosen model) as the model field. There's also a native DashScope SDK if you prefer first-party tooling.

The two things to use early: function calling (Qwen 3.6 Max is tuned for agentic tool use) and the open-weights self-host path via Hugging Face (for compliance, fine-tuning, or cost-floor deployment).

Account & keys

Visit Alibaba Cloud Model Studio and sign in with your Alibaba Cloud account. Choose the region that matches your latency / data-residency needs (China, Singapore, international).
Add a payment method. Pay-as-you-go in CNY or USD depending on region.
Generate an API key from the dashboard. Treat as a password — store in env vars, not in code.
Verify access: hit the chat completions endpoint with a small test request to confirm the key and region.

First API call

The Model Studio chat completions endpoint is OpenAI-compatible. If you've used the OpenAI SDK, the only changes are base_url and model.

Python — OpenAI SDK (Model Studio) from openai import OpenAI # Singapore region example; use your region's endpoint client = OpenAI( api_key="YOUR_DASHSCOPE_API_KEY", base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1", ) resp = client.chat.completions.create( model="qwen3.6-max", messages=[ {"role": "system", "content": "You are an autonomous app-dev agent."}, {"role": "user", "content": "Plan a React + FastAPI todo app. List file structure and key tradeoffs."}, ], ) print(resp.choices[0].message.content)

curl curl https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \ -H "Authorization: Bearer $DASHSCOPE_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "qwen3.6-max", "messages": [{"role":"user","content":"Hello, Qwen 3.6 Max."}] }'

OpenAI-compatible vs DashScope native

	OpenAI-compatible	DashScope native
Base URL pattern	`.../compatible-mode/v1`	`dashscope[-intl].aliyuncs.com/api/v1`
SDK	`openai`	`dashscope`
Best when	Migrating from OpenAI / minimal code change	You want first-party features and tight Alibaba Cloud integration
Multimodal models (Wan, HappyHorse)	Image / video gen via dedicated DashScope endpoints, not chat-completions	Native — these are first-class in DashScope

Region matters Alibaba runs separate Model Studio regions (China, Singapore-international, etc.). Pick the one that matches your latency, language, and data-residency needs. The endpoints, available models, and prices vary by region.

Function calls & JSON

Qwen 3.6 Max supports OpenAI-style tool calls and structured output. This is the foundation for agentic workflows.

Tool call { "model": "qwen3.6-max", "messages": [{"role":"user","content":"What's the latest deploy status for project Acme?"}], "tools": [{ "type": "function", "function": { "name": "get_deploy_status", "description": "Get the current deploy status for a project", "parameters": { "type": "object", "properties": { "project": {"type": "string"} }, "required": ["project"] } } }] }

JSON mode { "model": "qwen3.6-max", "messages": [ {"role":"system","content":"You output only valid JSON matching the schema."}, {"role":"user","content":"Extract {company, role, start_date, end_date} from this resume snippet: ..."} ], "response_format": { "type": "json_object" } }

Agentic tasks (Qwen 3.6 Max's flagship use case)

Qwen 3.6 Max is explicitly tuned for autonomous agent workflows. The minimum viable loop:

Agent loop (pseudocode) system = """You are an autonomous agent. Tools: read_file, write_file, run_tests, search_repo. Process for every task: 1. PLAN — read relevant context, restate the goal, list ordered changes, name risks. 2. EXECUTE — call tools one at a time. Observe results before next call. 3. VERIFY — re-read changes, run tests, summarize what's done. Emit reasoning before each tool call.""" while not done: resp = client.chat.completions.create( model="qwen3.6-max", messages=history, tools=TOOL_SCHEMAS, ) if resp.choices[0].message.tool_calls: for call in resp.choices[0].message.tool_calls: result = run_tool(call.function.name, call.function.arguments) history.append({"role":"tool","tool_call_id":call.id,"content":result}) else: done = True print(resp.choices[0].message.content)

The agentic frame matters Qwen 3.6 Max's quality jump over Plus is most visible inside an agent loop with explicit plan-execute-verify structure. A "just chat" prompt under-uses Max's tuning. If you're not running it in an agentic frame, drop down to Plus and save cost.

Self-host (open weights)

Qwen 3.6 (open variants) and Qwen 3.5 Omni / 397B are downloadable from Hugging Face. Common deployment paths:

Ollama for local laptop dev.
vLLM for production GPU serving.
Text Generation Inference (TGI) for Hugging Face-native serving.
llama.cpp for quantized inference on commodity hardware.
Alibaba Cloud PAI / EAS for first-party self-managed serving on Alibaba infra.

Qwen 3.6 Max is proprietary, not open Open weights cover Qwen 3.6 (35B-A3B / 27B), Qwen 3.5 (incl. 397B-A17B and Omni), and earlier generations. Qwen 3.6 Max is closed and only available via Model Studio. Plan accordingly if open-weights deployment is a hard requirement.

Streaming responses

For chat UIs and long generations, stream responses so users see progress instead of waiting for the full payload. The OpenAI-compatible endpoint supports stream=true exactly like the OpenAI SDK.

Python — streaming stream = client.chat.completions.create( model="qwen3.6-max", messages=[{"role": "user", "content": "Explain MoE routing in 1 paragraph."}], stream=True, ) for chunk in stream: delta = chunk.choices[0].delta.content or "" print(delta, end="", flush=True)

Rate limits, retries & errors

Treat the API as eventually reliable, not perfectly reliable. The minimum viable error-handling shape:

429 (rate limit): back off with jitter. Start at 1s, double up to ~30s. Don't retry forever — give up after 5 attempts and surface to the caller.
5xx (transient): retry up to 3 times with backoff; then fail loud.
4xx other than 429: don't retry — these are bugs in your request shape (bad model name, malformed JSON, missing field). Log and fix.
Timeouts: set a sensible client timeout (e.g. 60s for Max non-streaming, longer for streaming). Don't inherit the default — it's usually too long.

Python — retry with backoff import time, random from openai import OpenAI, RateLimitError, APIError def call_with_retry(messages, model="qwen3.6-max", max_attempts=5): delay = 1.0 for attempt in range(max_attempts): try: return client.chat.completions.create(model=model, messages=messages) except RateLimitError: sleep = delay + random.uniform(0, delay * 0.5) time.sleep(min(sleep, 30)) delay *= 2 except APIError as e: if 500 <= getattr(e, "status_code", 0) < 600 and attempt < 2: time.sleep(delay) delay *= 2 continue raise raise RuntimeError(f"giving up after {max_attempts} attempts")

Cost-control patterns

Two-tier routing. Send simple work (classification, extraction, formatting) to qwen3.6-plus; reserve qwen3.6-max for actual reasoning / agentic / visual tasks. Often cuts cost 60-80% with no quality drop.
Cap output tokens. Set max_tokens deliberately. With long-context inputs it's tempting to let output run unbounded — costs add up fast.
Short system prompts. A 5-line system prompt that defines role and constraints crisply outperforms a 30-line one in most evals — and costs less per call.
Stop iterating in chat for large workflows. Convert "tell me more about X" loops into one well-scoped prompt — each iteration replays the entire context.
Choose the right region. Latency and pricing differ across China / Singapore / international Model Studio regions. Match to where your users (and data) live.

Multi-region considerations

Alibaba Cloud Model Studio runs in multiple regions. The same model ID (e.g. qwen3.6-max) can have different latency, availability, and per-1M-token pricing depending on the region your endpoint lives in. Practical rules:

Pin a region per environment (dev / staging / prod). Don't let it drift via DNS.
Test at least one alternate region so you have a documented failover path.
Data-residency-sensitive workloads: the Singapore-international region is typically the right choice for non-PRC data flows; verify the current terms before committing.
Don't assume PII-handling parity across regions. The contracts and certifications differ.

🎈 ELI5

Good prompts give Qwen four things: a role (who to be), a goal (what good looks like), an audience (who's reading), and a format (how to lay it out). For Qwen 3.6 Max specifically: add plan-execute-verify structure for hard agentic work.

Use-case library

Tested templates organized by what you're trying to accomplish. Each one assumes Qwen 3.6 Max unless noted; drop down to qwen3.6-plus when the task is simpler and cost matters.

Agentic app development (Qwen 3.6 Max's headline)

Build a feature end-to-end (plan → execute → verify)

Agentic app dev You are a senior engineer running in an autonomous agent loop. Tools: read_file, write_file, run_tests, search_repo. Task: [describe the feature] Process: 1. PLAN — Read relevant files. Restate the task. List changes in order. Identify risks. 2. EXECUTE — Make changes one file at a time. After each meaningful change, run tests if relevant. 3. VERIFY — Re-read the diff. Confirm match against plan. Run full test suite. If anything fails, diagnose and fix before declaring done. Emit reasoning before each tool call. Final summary: files changed, tests run, caveats.

Add a new endpoint with tests (smaller-scoped)

Add endpoint You have read access to the codebase. Add a new endpoint: Spec: [HTTP method, path, inputs, outputs, error cases] Steps: 1. Locate the routing layer and existing handlers; mirror their conventions. 2. Add the handler + any DB access + validation. 3. Wire it into the router. 4. Write 3 tests: happy path, validation error, auth error. 5. Run the test file and confirm green. Show me the full diff before declaring done. Don't add features I didn't ask for.

Visual browsing & UI reasoning (the other Max headline)

Walk through a UI workflow from screenshots

Visual browsing I'm giving you a sequence of screenshots from [a webpage / dashboard / app]. Walk through the workflow visually and answer: 1. What is this surface designed to do? 2. What are the 3 most important interactive elements, and where are they? 3. If I wanted to [accomplish task], what's the click path? Note any friction or confusing UX. 4. Anything visually broken, inconsistent, or accessibility-poor? Cite specific elements by location and label. Be precise — assume your output will be used to file UX issues.

UX audit of a single screen

UX audit Audit the attached screen as if you're a senior product designer. Output: 1. Primary goal of this screen — what is it trying to get the user to do? 2. Visual hierarchy — does the most important action have visual primacy? Cite the elements. 3. Information density — too much / too little / right? Cite specifics. 4. Top 3 issues, severity (blocker / major / minor) and a one-sentence fix each. 5. One thing the design does well — be specific. Be concrete; reference specific elements by location and label.

Long-context analysis (1M+ tokens)

Whole-codebase or whole-document review

Long-context I'm giving you a complete [codebase / research bundle / set of documents]. Use the full context — don't artificially chunk. Goal: [your goal] Output sections: 1. High-level summary (3 bullets). 2. Cross-cutting patterns or themes you noticed. 3. Specific findings with citations (file:line or doc:section). 4. Open questions / what's missing. 5. Suggested next steps in priority order.

Cross-document research synthesis with citations

Research synthesis I'm uploading a bundle of [N] documents. Synthesize what they collectively say about [topic]. Output: 1. Areas of agreement (with citations: doc-name:section). 2. Areas of disagreement, including which sources take which positions. 3. Gaps — questions the bundle doesn't answer. 4. The 5 most important takeaways, prioritized by decision-relevance. 5. A 1-paragraph TL;DR a non-expert could read. Cite every factual claim. If the bundle is silent on a question, say so explicitly — don't fill in from your training data.

Coding & engineering

Code review against your own conventions

Code review You are a senior reviewer for [language/framework]. Review the diff below against these conventions: [your house style / linter rules / architectural rules] Output for each finding: - file:line — severity (blocker/major/minor) — one-sentence rationale — suggested fix - Skip nitpicks unless they obscure intent - Note positives only if they're surprising or worth replicating

Debug a failing test

Debug This test is failing with the error below. Diagnose the root cause and suggest a fix. Test code: [paste] Production code under test: [paste] Error / stack: [paste] Process: 1. Restate the failure in your own words. 2. List 3 hypotheses for the root cause, ranked by likelihood. 3. Walk through which hypothesis the evidence best supports. 4. Suggest a minimal fix. 5. Note if the test itself might be wrong (don't always trust the test).

API / interface design review

API design I'm designing an API for [purpose]. Below is the draft. [paste interface / OpenAPI / type signatures] Review for: 1. Consistency — naming, casing, plural/singular, response shapes. 2. Errors — are error cases enumerable and informative? 3. Versioning & deprecation — is the surface forward-compatible? 4. Composability — can clients combine endpoints without N+1 traffic? 5. Surprises — anything a competent caller would not expect? Suggest specific changes; don't redesign from scratch unless I ask.

Multimodal pipelines (Qwen Image / Wan / HappyHorse)

Marketing video pipeline (Max → Image → HappyHorse)

Pipeline orchestrator Plan a 15-second marketing video for [product] targeting [audience]. Output 3 stages: 1. STILL FRAMES — describe 3 image-generation prompts to send to Qwen Image. Each prompt should be self-contained (style, subject, composition, mood). 2. ANIMATIONS — for each generated still, write the motion prompt to send to HappyHorse 1.0 (what moves, how fast, camera). 3. SCRIPT — voiceover lines + on-screen text for each shot, with timestamps. End with a 1-line summary I can hand off to a producer.

Animate a single product photo (HappyHorse)

Image-to-video direction I'm uploading a still product photo. Produce the motion prompt I should send to HappyHorse 1.0 to bring it to life as a 5-8 second loop. Specify: 1. What moves (the product? the background? both?) 2. How it moves (rotation / drift / parallax / particles / lighting shift) 3. Camera behavior (locked / slight push / pan) 4. Duration and pacing 5. End-frame guidance so the loop can repeat seamlessly Aim for understated, premium motion — not cartoony.

Data extraction & structured output

Extract a typed schema from messy text

Schema extraction Extract the following schema from the input below. Return ONLY valid JSON — no prose, no code fences. Schema: { "company": "string", "role": "string", "start": "YYYY-MM", "end": "YYYY-MM | 'present'", "skills": ["string"] } If a field is missing, use null (not empty string). If multiple records, return an array. Input: """ [paste] """

Normalize free-form data into rows

Normalize I'll paste a block of free-form text containing [N] records. Normalize them into TSV with columns: [col1, col2, col3]. Rules: - One record per line. - Use a tab between columns. - Trim whitespace; preserve original casing. - Empty cells = literal "NA". - If a record can't be parsed cleanly, output it on its own line prefixed with "# UNPARSED:". Don't add commentary. Just the TSV.

Translation & multilingual (Qwen's strength)

High-fidelity Chinese ↔ English translation

Translation Translate the following from [source language] to [target language]. Preserve: - Tone and register (formal / casual / technical) - Cultural references — gloss them inline only if needed - Names and proper nouns (don't translate) - Numbers and dates in the target locale's conventional format If a phrase has no clean equivalent, give the closest natural rendering and flag it briefly with [tn: ...]. Input: """ [paste] """

Localize marketing copy (not just translate)

Localization Localize the marketing copy below from [source] for [target market]. This is localization, not literal translation: the goal is a version a native speaker would say a local team wrote. For each line of copy: 1. Translate the literal meaning. 2. Suggest a localized version that captures the intent more naturally. 3. Note any cultural / tonal mismatches I should know about. Source copy: """ [paste] """

Research & decision-making

Steelman opposing positions before deciding

Steelman + decide I'm trying to decide whether to [decision]. Before recommending, do this: 1. Steelman the case FOR — strongest version of the argument, even if you don't believe it. 2. Steelman the case AGAINST — same. 3. Identify the decisive factor — what would you need to know to call it? 4. Recommend, with the reasoning that drove your call. 5. Note what would change your recommendation. Be willing to say "I'd want more info before recommending" if that's the honest answer.

Risk pre-mortem

Pre-mortem I'm about to do [action / launch / decision]. Run a pre-mortem: assume it's 6 months later and the project failed. List the top 8 reasons it could have failed, ranked by likelihood × impact. For each: - Failure mode (1 sentence) - Probability (low / medium / high) - Impact (low / medium / high) - Earliest signal that this is happening - Cheapest way to mitigate or detect End with: top 3 risks I should de-risk this week.

Writing & editing

Tighten prose without losing voice

Edit Edit the draft below to be tighter without changing the voice. Goals: - Cut filler ("very", "really", "in order to", redundant phrases). - Replace passive voice with active where it sharpens the sentence. - Combine consecutive sentences that share a subject. - Keep specific words and metaphors I chose — those are signature. Output: the edited version + a short list of the 5 changes that mattered most. Draft: """ [paste] """

Outline a long-form piece before writing

Outline I want to write a [length, format] piece on [topic] for [audience]. Before drafting, produce: 1. A working thesis (one sentence). 2. The 3-5 sub-arguments that build to it, in the order they should appear. 3. For each sub-argument: 2-3 supporting examples or pieces of evidence. 4. The single hardest objection a sharp reader will raise — and how the piece addresses it. 5. The opening hook — 2 candidate options with different vibes. Aim for the version of the piece that earns being read.

Prompt builder

Goal Role (optional) Audience Topic / input Output format Constraints (optional) Thinking mode

Patterns library

Reusable shapes that improve quality independent of the specific task. Most apply to any model; a few are tuned to Qwen 3.6 Max specifically.

The "plan-execute-verify" pattern (Qwen 3.6 Max)

For any non-trivial agentic task, ask Max to output a plan first, wait for approval, then execute. Catches misunderstandings early and uses Max's tuning correctly.

Plan → execute Before doing anything, write a 1-page plan for [task]. Include: - Goal restated in your own words - Ordered list of changes / actions - Tools you'll need - Risks / decisions that need my input - Definition of "done" Wait for me to approve the plan before executing.

The "two-tier" pattern (Plus → Max)

For multi-step pipelines, route simple steps to Qwen 3.6 Plus (cheap, fast) and escalate only the hardest to Max. Often cuts cost 60-80% with no quality loss. Good rule of thumb: triage / classify / route / format → Plus. Reason / plan / write the substantive answer → Max.

The "modality handoff" pattern

Qwen 3.6 Max plans → Qwen Image generates stills → HappyHorse / Wan animates. Each model handles its strongest modality. Single-vendor pipeline keeps integration simple. Bonus: Qwen 3.5 Omni in the loop if audio is involved.

The "schema first" pattern

When you need structured output, define the schema in the prompt before showing the input. Qwen follows declared schemas reliably; ad-hoc "give me JSON" prompts produce more drift.

Schema first Output ONLY valid JSON matching this schema. No prose, no code fences. Schema: { "field_a": "string", "field_b": "number | null", "tags": ["string"] } Now the input: """ [input] """

The "self-critique" pattern

After a substantive answer, ask the model to critique its own output before you accept it. Often catches subtle errors a second pass would have caught.

Self-critique Now critique your own answer above as if you were a senior reviewer who didn't write it: - What's the strongest objection? - Is anything overstated? - Is anything missing that the user actually needed? - What would you change in a v2? Then output the v2.

The "show your work first" pattern

For reasoning-heavy tasks, ask for the work before the answer. Reduces sycophantic / shortcut answers and makes errors easier to spot.

Show work first Before giving the final answer, walk through your reasoning step by step. Then state the answer in one sentence. Then note your confidence (high / medium / low) and the single biggest source of uncertainty.

The "bounded context" pattern

1M tokens is a tool, not an obligation. If you're using only a fraction of the context, say so explicitly — it lowers the chance of the model anchoring on irrelevant earlier turns.

Bounded context For this turn, use ONLY the document below. Ignore anything else from earlier in the conversation. If the document doesn't answer the question, say so — do not fall back to your training data. Document: """ [paste] """ Question: [question]

The "force the disagreement" pattern

Qwen models can be agreeable. To stress-test a plan, ask it to argue the opposite case before recommending.

Force disagreement I'm planning to do [X]. Before agreeing or helping, argue the strongest case AGAINST [X]. Steelman it — give me the version of the counter-argument I'd struggle to refute. Then, after laying out the counter-case, give your honest synthesis: do I do [X], or not? You're allowed to say "don't."

Anti-patterns

Things that look helpful but consistently make outputs worse.

❌ Padding the prompt with "please" / "kindly" / role flattery

"You are the world's leading expert on…" doesn't make Qwen smarter — it just makes the prompt longer and the response more likely to flatter back. State the role in 1 sentence; spend the saved tokens on the actual task and constraints.

❌ Asking for "comprehensive" without bounds

"Give me a comprehensive analysis of X" produces a structureless wall of text. Replace with explicit sections, length limits, and the audience: "Give me 3 paragraphs for a CFO who skims, covering: (1) what changed, (2) financial impact, (3) what to do."

❌ Stacking instructions in one paragraph

Buried instructions get dropped. Use ordered lists, headings, or numbered steps. Qwen reliably follows structure but less reliably parses 6 demands hidden in one comma-laden sentence.

❌ Fighting the model when a tool would do

Don't try to talk Qwen into reasoning over data it can't actually see. If the answer requires a database query / web search / file read, expose those as tools (function calling) instead of pasting screenshots of console output and hoping. Max's agentic tuning shines when tools exist.

❌ Defaulting to Max for trivial work

Triage, classification, simple rewrites — these belong on Qwen 3.6 Plus. Using Max for them burns budget for no quality lift and slows latency-sensitive paths.

❌ Mixing modalities in one turn when you don't have to

If the task is "describe this image", great — vision in one turn. But "describe this image, then design an entire app, then write a poem" produces shallow output across all three. Split into focused turns.

Universal rescue prompts

When an answer is off, these one-liners reliably get a better one without needing to rewrite the original prompt.

Too vague That answer was too generic. Give me the version with specific names, numbers, and examples — even if you have to make assumptions, state them.

Too long Cut that to 1/3 the length. Keep the load-bearing sentences; drop the framing, transitions, and softening.

Off-track Stop. You drifted. Re-read my original ask, restate it in one sentence, then give a clean answer that addresses only that.

Sycophantic Drop the "great question" / "absolutely" intros. Start with the answer. If I'm wrong about something, tell me — I'd rather hear it than be agreed with.

Pick a side That answer hedged. Give me your actual recommendation — you can be wrong; I'll push back if I disagree. Pick a side and defend it.