Will It Fit?

Estimate text inference VRAM across Transformers and vLLM with a compact, explainable breakdown.

Model pages and release notes are included here too.

Hugging Face URL

Model

GPU

Runtime

21B total • 3.6B active • 128,000 context • 8 KV heads

vLLM estimates use the selected GPU memory utilization. Transformers stays a fixed 4K single-request baseline.

Use the calculator first, then go deeper with model pages and release notes when you want architecture context, memory behavior, or newer inference updates.

Model research

About models

OpenAI

GPT-OSS 20B

Smaller GPT-OSS release for general-purpose and reasoning use cases that need to stay within a much lighter single-card memory budget.

21B total • 3.6B active • 128,000 context • 8 KV heads

OpenAI

GPT-OSS 120B

Production GPT-OSS release for general-purpose and higher-reasoning workloads that can fit on a single 80 GB class GPU.

117B total • 5.1B active • 128,000 context • 8 KV heads

Meta Llama

Llama 3.1 8B

Compact dense Llama model with grouped-query attention and a 128K context window.

8B dense • 131,072 context • 8 KV heads

Meta Llama

Llama 3.1 70B

High-capacity dense Llama model that is common in serious long-context inference and fine-tuning work.

70.6B dense • 131,072 context • 8 KV heads

Qwen

Qwen 2.5 0.5B

Instruction-tuned 0.5B Qwen2.5 model for lightweight assistant, structured-output, and long-prompt use in very small dense deployments.

490M dense • 32,768 context • 2 KV heads

Qwen

Qwen 2.5 1.5B

Instruction-tuned 1.5B Qwen2.5 model for lightweight coding, math, structured-output, and assistant tasks in a small dense deployment footprint.

1.5B dense • 32,768 context • 2 KV heads

Companies

Model and inference notes

2026-05-10

How inference VRAM is calculated

A practical breakdown of the terms that usually matter for inference memory: weights, KV cache, state, and runtime reserve.

Will It Fit?

About models

GPT-OSS 20B

GPT-OSS 120B

Llama 3.1 8B

Llama 3.1 70B

Qwen 2.5 0.5B

Qwen 2.5 1.5B

Browse by model source

OpenAI

GLM

MiniMax

Kimi

Mistral

NVIDIA

Meta Llama

Gemma

Model and inference notes

How inference VRAM is calculated