FitMyGPU

Will It Fit?

Estimate text inference VRAM across Transformers and vLLM with a compact, explainable breakdown.

Model pages and release notes are included here too.

21B total • 3.6B active • 128,000 context • 8 KV heads

vLLM estimates use the selected GPU memory utilization. Transformers stays a fixed 4K single-request baseline.

Use the calculator first, then go deeper with model pages and release notes when you want architecture context, memory behavior, or newer inference updates.

Model research

About models

Companies

Browse by model source

Latest releases

Model and inference notes