Back to calculator

Model notes

Mixtral 8x7B

Sparse MoE model where runtime compute is closer to one expert pair, but VRAM still pays for resident weights.

46.7B total • 12.9B active • 32,768 context • 8 KV heads

Architecture

Model spec

Architecture

Mixture-of-experts transformer

Total params

46.7B

Active params

12.9B

Layers

32

Hidden size

4,096

Attention heads

32

KV heads

8

KV-bearing layers

32

Context length

32,768

Modality

Text

License

Apache 2.0

Why it matters

Why memory behaves this way

Research highlight

Sparse expert routing keeps per-token compute closer to active experts than to total parameters.

Memory note

Even though only a subset of experts is active per token, single-GPU VRAM still carries the resident experts in memory.

Checkpoints

Official profiles

Official BF16 checkpoint

BF16 checkpoint

Current

Mistral's official Mixtral 8x7B release is a BF16 checkpoint.

vLLMTransformers
Open checkpoint

Sources

Reference links