Back to calculator

Model notes

Qwen 2.5 32B

Larger dense Qwen variant that often pushes single-GPU inference toward aggressive quantization.

32.5B dense • 131,072 context • 8 KV heads

Architecture

Model spec

Architecture

Dense decoder-only transformer

Total params

32.5B

Active params

Dense model

Layers

64

Hidden size

5,120

Attention heads

40

KV heads

8

KV-bearing layers

64

Context length

131,072

Modality

Text

License

Apache 2.0

Why it matters

Why memory behaves this way

Research highlight

High-capacity dense Qwen checkpoint optimized for long-context inference rather than sparse routing.

Memory note

Dense resident weights dominate here, which is why 4-bit loading is usually the difference between fitting and not fitting on one card.

Checkpoints

Official profiles

Official BF16 checkpoint

BF16 checkpoint

Current

The official Qwen2.5-32B-Instruct checkpoint repository is about 65.5 GB on Hugging Face.

vLLMTransformers
Open checkpoint

Official GPTQ 4-bit checkpoint

4-bit checkpoint

The official Qwen2.5-32B-Instruct-GPTQ-Int4 checkpoint repository is about 19.4 GB on Hugging Face.

vLLMTransformers
Open checkpoint

Official AWQ 4-bit checkpoint

4-bit checkpoint

The official Qwen2.5-32B-Instruct-AWQ checkpoint repository is about 19.3 GB on Hugging Face.

vLLMTransformers
Open checkpoint

Sources

Reference links