FitMyGPU
Back to calculator

DeepSeek

DeepSeek R1 Distill Llama 8B

DeepSeek R1 distill on a Llama dense backbone, giving users a familiar 8B serving shape with stronger reasoning-oriented post-training.

Overview and architecture

What it is

Company

DeepSeek

Family

DeepSeek R1 Distill

Release date

Jan 20, 2025

Architecture

Dense decoder-only transformer

License

MIT + Llama license

Modality

Text

Context window

131,072

Total params

8B

Active params

Dense model

Layers

32

Hidden size

4,096

Attention heads

32

KV heads

8

KV-bearing layers

32

Research highlight

What improved

R1 reasoning distillation

DeepSeek frames the 8B distill models around one central result: reasoning behavior learned in a very large RL-trained model can be transferred into smaller dense backbones effectively.

Dense models can stay competitive

The release argues that smaller dense checkpoints distilled from R1 can outperform many comparably sized reasoning baselines without needing frontier-scale MoE deployment.

Post-training over architecture change

These models matter because of the R1 distillation pipeline and reasoning data, not because they introduce a new attention or cache architecture.

Training and release context

How it was released

Release lineage

The DeepSeek-R1 distill line is derived from samples generated by DeepSeek-R1 rather than trained as a separate base-model family from scratch.

Backbone choice

This checkpoint keeps the serving geometry of Llama 3.1 8B, so its VRAM behavior follows that underlying dense architecture rather than the giant DeepSeek-R1 MoE backbone.

Usage guidance

DeepSeek recommends temperature-controlled reasoning usage, avoiding system prompts, and explicitly steering the model into think-first behavior for best results.

Where it is strong

Where it is strong

Reasoning per GPU

The 8B distill line is for users who want much of the R1 reasoning style without deploying the full 671B-class DeepSeek model.

Math and code

DeepSeek emphasizes strong gains on math, code, and structured reasoning benchmarks across the distilled checkpoints.

Normal serving stack

Because these stay on familiar Qwen or Llama dense backbones, they fit standard local inference tooling far more easily than the full R1 architecture.

Memory behavior

What dominates VRAM

This is standard Llama-style dense memory behavior with grouped KV heads, so the estimator can treat it as a straightforward long-context dense checkpoint.

Sources

Where this page is grounded