FitMyGPU
Back to calculator

DeepSeek

DeepSeek R1 Distill Qwen 7B

Mid-sized DeepSeek R1 distill on a Qwen dense backbone, aimed at practical local reasoning without the huge footprint of the full R1 model.

Overview and architecture

What it is

Company

DeepSeek

Family

DeepSeek R1 Distill

Release date

Jan 20, 2025

Architecture

Dense decoder-only transformer

License

MIT + Apache 2.0

Modality

Text

Context window

131,072

Total params

7B

Active params

Dense model

Layers

28

Hidden size

3,584

Attention heads

28

KV heads

4

KV-bearing layers

28

Research highlight

What improved

R1 reasoning distillation

DeepSeek frames the 7B distill models around one central result: reasoning behavior learned in a very large RL-trained model can be transferred into smaller dense backbones effectively.

Dense models can stay competitive

The release argues that smaller dense checkpoints distilled from R1 can outperform many comparably sized reasoning baselines without needing frontier-scale MoE deployment.

Post-training over architecture change

These models matter because of the R1 distillation pipeline and reasoning data, not because they introduce a new attention or cache architecture.

Training and release context

How it was released

Release lineage

The DeepSeek-R1 distill line is derived from samples generated by DeepSeek-R1 rather than trained as a separate base-model family from scratch.

Backbone choice

This checkpoint keeps the serving geometry of Qwen2.5-Math-7B, so its VRAM behavior follows that underlying dense architecture rather than the giant DeepSeek-R1 MoE backbone.

Usage guidance

DeepSeek recommends temperature-controlled reasoning usage, avoiding system prompts, and explicitly steering the model into think-first behavior for best results.

Where it is strong

Where it is strong

Reasoning per GPU

The 7B distill line is for users who want much of the R1 reasoning style without deploying the full 671B-class DeepSeek model.

Math and code

DeepSeek emphasizes strong gains on math, code, and structured reasoning benchmarks across the distilled checkpoints.

Normal serving stack

Because these stay on familiar Qwen or Llama dense backbones, they fit standard local inference tooling far more easily than the full R1 architecture.

Memory behavior

What dominates VRAM

This is still standard dense GQA memory behavior, so runtime choice and context length remain the main VRAM levers after the 7B weight floor is paid.

Sources

Where this page is grounded