Will It Fit?
Estimate text inference VRAM across Transformers and vLLM with a compact, explainable breakdown.
Model pages and release notes are included here too.
Estimate text inference VRAM across Transformers and vLLM with a compact, explainable breakdown.
Model pages and release notes are included here too.
Use the calculator first, then go deeper with model pages and release notes when you want architecture context, memory behavior, or newer inference updates.
Model research
OpenAI
Smaller GPT-OSS release for general-purpose and reasoning use cases that need to stay within a much lighter single-card memory budget.
21B total • 3.6B active • 128,000 context • 8 KV heads
OpenAI
Production GPT-OSS release for general-purpose and higher-reasoning workloads that can fit on a single 80 GB class GPU.
117B total • 5.1B active • 128,000 context • 8 KV heads
Meta Llama
Compact dense Llama model with grouped-query attention and a 128K context window.
8B dense • 131,072 context • 8 KV heads
Meta Llama
High-capacity dense Llama model that is common in serious long-context inference and fine-tuning work.
70.6B dense • 131,072 context • 8 KV heads
Qwen
Instruction-tuned 0.5B Qwen2.5 model for lightweight assistant, structured-output, and long-prompt use in very small dense deployments.
490M dense • 32,768 context • 2 KV heads
Qwen
Instruction-tuned 1.5B Qwen2.5 model for lightweight coding, math, structured-output, and assistant tasks in a small dense deployment footprint.
1.5B dense • 32,768 context • 2 KV heads
Companies
Open-weight and hosted model releases with strong inference interest and frequent runtime discussion.
Zhipu / GLM model family coverage placeholder for future registry and research updates.
MiniMax model family coverage placeholder for future estimator support and release notes.
Moonshot / Kimi coverage placeholder for future long-context and hybrid-architecture notes.
Dense and MoE Mistral-family models commonly used in open inference stacks.
Nemotron and related inference-oriented open models, often tied closely to deployment runtimes.
Llama-family dense transformer models that set a practical baseline for open inference planning.
Google Gemma coverage placeholder for future long-context and compact-model releases.
Latest releases