Model notes
Qwen 3.5 4B
Mid-sized Qwen3.5 checkpoint with a larger resident multimodal footprint but still practical for careful single-GPU text-only serving.
5B dense • 262,144 context • 4 KV heads
Architecture
Model spec
Architecture
Total params
Active params
Layers
Hidden size
Attention heads
KV heads
KV-bearing layers
Context length
Modality
License
Why it matters
Why memory behaves this way
Research highlight
The 4B language model sits inside a roughly 5B resident multimodal artifact and uses only 8 gated-attention layers for KV-heavy generation.
Memory note
The hybrid layout keeps cache growth lower than dense 32-layer models, but the extra multimodal resident weights raise the single-card floor.
Checkpoints
Official profiles
Official BF16 checkpoint
BF16 checkpoint
Qwen publishes Qwen3.5-4B in Hugging Face Transformers format with explicit Transformers and vLLM guidance, including a text-only serving mode in vLLM.
Sources