Model notes
Llama 3.1 70B
High-capacity dense Llama model that is common in serious long-context inference and fine-tuning work.
70.6B dense • 131,072 context • 8 KV heads
Architecture
Model spec
Architecture
Dense decoder-only transformer
Total params
70.6B
Active params
Dense model
Layers
80
Hidden size
8,192
Attention heads
64
KV heads
8
KV-bearing layers
80
Context length
131,072
Modality
Text
License
Llama 3.1 Community License
Why it matters
Why memory behaves this way
Research highlight
Large dense transformer with grouped-query attention and a long 128K context design.
Memory note
Most of the VRAM goes into resident dense weights, so quantization is the key lever for single-GPU inference.
Checkpoints
Official profiles
Official BF16 checkpoint
BF16 checkpoint
Meta's official Llama 3.1 70B Instruct release is a BF16 checkpoint with grouped-query attention.
vLLMTransformers
Open checkpointSources