FitMyGPU
Back to calculator

Qwen

Qwen 3 235B A22B

Largest Qwen3 MoE release with 235B total parameters and 22B activated parameters, aimed at frontier-scale open reasoning and agent use.

Overview and architecture

What it is

Company

Qwen

Family

Qwen

Release date

Apr 27, 2025

Architecture

Mixture-of-experts transformer

License

Apache 2.0

Modality

Text

Context window

131,072

Total params

235B

Active params

22B

Layers

94

Hidden size

4,096

Attention heads

64

KV heads

4

KV-bearing layers

94

Research highlight

What improved

Flagship Qwen3 MoE

This is the largest-capacity Qwen3 release in the current line, intended as the top open model in the family.

Sparse activation at scale

The model keeps 235B total parameters but only 22B activated per token, which is the core reason to deploy it as an MoE rather than a dense frontier-scale model.

Reasoning and agent focus

Qwen still frames the flagship around reasoning, instruction following, and agent-style workflows rather than only benchmark scale.

Training and release context

How it was released

MoE family branch

Qwen3 includes dedicated MoE models alongside the dense line, keeping the same user-facing thinking/non-thinking framing while changing the serving geometry materially.

Sparse activation

The MoE releases expose total and activated parameter counts separately, which is the key deployment distinction versus the dense Qwen3 models.

Long-context packaging

The base MoE releases are published with 32K native context and 131K support with YaRN, while the 2507 update is packaged at 256K native context.

Where it is strong

Where it is strong

Reasoning with lower active compute

The MoE line is for users who want larger total capacity without paying dense-model active compute per token.

Agent and tool use

Qwen still positions the MoE branch around agent workflows, tool calling, and mixed reasoning/general dialogue use.

Large multilingual serving

Useful when you want very large-capacity multilingual serving without moving to a purely dense 70B+ model.

Memory behavior

What dominates VRAM

Even with only 22B activated per token, the full 235B resident expert pool dominates VRAM immediately, so this is primarily a multi-card or very-large-card deployment target.

Sources

Where this page is grounded