Model notes
Mixtral 8x7B
Sparse MoE model where runtime compute is closer to one expert pair, but VRAM still pays for resident weights.
46.7B total • 12.9B active • 32,768 context • 8 KV heads
Architecture
Model spec
Architecture
Mixture-of-experts transformer
Total params
46.7B
Active params
12.9B
Layers
32
Hidden size
4,096
Attention heads
32
KV heads
8
KV-bearing layers
32
Context length
32,768
Modality
Text
License
Apache 2.0
Why it matters
Why memory behaves this way
Research highlight
Sparse expert routing keeps per-token compute closer to active experts than to total parameters.
Memory note
Even though only a subset of experts is active per token, single-GPU VRAM still carries the resident experts in memory.
Checkpoints
Official profiles
Official BF16 checkpoint
BF16 checkpoint
Mistral's official Mixtral 8x7B release is a BF16 checkpoint.
vLLMTransformers
Open checkpointSources