Qwen

Qwen 3 0.6B

Smallest dense Qwen3 release with switchable thinking and non-thinking modes in a very light deployment footprint.

Overview and architecture

What it is

Company

Qwen

Family

Qwen

Release date

Apr 27, 2025

Architecture

Dense decoder-only transformer

License

Apache 2.0

Modality

Text

Context window

32,768

Total params

600M

Active params

Dense model

Layers

Hidden size

1,024

Attention heads

KV heads

KV-bearing layers

Research highlight

What improved

Thinking-mode switch

Qwen3’s defining change is seamless switching between deeper reasoning and faster non-thinking dialogue within the same checkpoint.

Reasoning uplift

Qwen positions the line as stronger than QwQ in thinking mode and stronger than Qwen2.5 instruct models in non-thinking mode on reasoning-heavy tasks.

Agent and multilingual focus

The release also emphasizes stronger agent use and support for 100+ languages and dialects, even at smaller sizes.

Training and release context

How it was released

Family release

Qwen3 is released as a dense and MoE model family centered on switching between thinking and non-thinking modes within the same model.

Training stage

Qwen describes the release as a pretraining plus post-training model rather than a small instruction-only adaptation.

Context packaging

The 0.6B model is published with 32K native context, and the larger dense variants explicitly extend to 131K with YaRN.

Where it is strong

Thinking and non-thinking use

The 0.6B release is built to switch between deeper reasoning mode and faster general dialogue mode without changing models.

Agent workflows

Qwen positions the family for tool use and agent-style tasks in both thinking and non-thinking modes.

Multilingual assistant work

The family is published with support for 100+ languages and dialects, making it a broad multilingual assistant line rather than a narrow specialist release.

Memory behavior

What dominates VRAM

At this size the resident weight floor stays small, so runtime reserve and long-context cache can become a larger fraction of total VRAM than on bigger dense models.

Sources

Where this page is grounded

https://huggingface.co/Qwen/Qwen3-0.6Bopen