Back to calculator

Model notes

Llama 3.1 70B

High-capacity dense Llama model that is common in serious long-context inference and fine-tuning work.

70.6B dense • 131,072 context • 8 KV heads

Architecture

Model spec

Architecture

Dense decoder-only transformer

Total params

70.6B

Active params

Dense model

Layers

80

Hidden size

8,192

Attention heads

64

KV heads

8

KV-bearing layers

80

Context length

131,072

Modality

Text

License

Llama 3.1 Community License

Why it matters

Why memory behaves this way

Research highlight

Large dense transformer with grouped-query attention and a long 128K context design.

Memory note

Most of the VRAM goes into resident dense weights, so quantization is the key lever for single-GPU inference.

Checkpoints

Official profiles

Official BF16 checkpoint

BF16 checkpoint

Current

Meta's official Llama 3.1 70B Instruct release is a BF16 checkpoint with grouped-query attention.

vLLMTransformers
Open checkpoint

Sources

Reference links