How We Estimate LLM Hardware Compatibility
The Formula
Every "can I run this model on this hardware?" answer on this site is computed using the same formula — no guesswork, no marketing. Here's exactly what we calculate:
-
Reserve memory for the OS/driver. We subtract a fixed overhead
from total memory depending on hardware class:
- Discrete GPUs: 0.8 GB (display buffer + driver overhead)
- Apple Silicon (unified memory): 4 GB (macOS + display)
- System RAM: 6 GB (OS + background processes)
- Estimate KV cache memory. The KV cache stores attention keys and values during inference. We estimate it at 15% of the model's storage size at default context length. This is a heuristic — real KV cache usage varies with context length, batch size, and architecture.
- Compute total required memory: model size + KV cache estimate.
- Check fit: total required ≤ usable memory (total minus reserve).
- Estimate speed: memory bandwidth (GB/s) ÷ active model size (GB) = tokens per second. This is a memory-bandwidth-bound estimate — for most local LLM inference, the GPU's compute is not the bottleneck; how fast you can read the model weights from VRAM is.
MoE Models: Why "Active Parameters" Matter
Mixture-of-Experts (MoE) models like Mixtral 8x7B, DeepSeek-R1, and Qwen3 MoE have more total parameters than they activate at any given time. For example, Mixtral 8x7B has 47B total parameters spread across 8 experts, but only 2 experts (~13B parameters) are used per token. This means:
- Storage is driven by total parameters — you need enough memory to hold the entire model.
- Speed is driven by active parameters — each token only reads the activated experts from memory.
Our speed estimates use active_params_billion for MoE models,
which is why Mixtral 8x7B shows dramatically higher tok/s than a dense 47B model
would — it's genuinely faster in practice.
Known Simplifications
- Fixed reserve memory — real OS/driver overhead varies by display configuration, background applications, and OS version. Our constants are starting assumptions; community benchmarks are the best source of ground-truth at the margin.
- KV cache as a flat percentage — in reality, KV cache grows linearly with context length and varies by model architecture. Our 15% estimate is for default (~8K) context. Longer contexts need more memory.
- Memory bandwidth as the sole speed driver — this is accurate for most local LLM inference (which is memory-bandwidth-bound), but compute-bound scenarios (very small models, high batch sizes, prompt processing) may differ.
- No multi-GPU modeling — we only model single-GPU configurations for now. Multi-GPU setups (2× RTX 4090, etc.) are on the roadmap.
- Tuning backends differ — llama.cpp, Ollama, LM Studio, vLLM, and other inference engines have different memory footprints and performance characteristics even for the same model+hardware pair. Our estimates are backend-agnostic; community benchmarks are tagged with the software used.
Data Sources
- Hardware specs: TechPowerUp GPU Database, NVIDIA official specs, AMD product pages, Apple official tech specs. Each hardware row links to its source.
- Model parameters: HuggingFace model cards, official model release blogs and technical reports. Each model row links to its HuggingFace page.
- Quantization sizes: Cross-referenced from actual GGUF file sizes published by the llama.cpp community (Bartowski, TheBloke, and others) and the llama.cpp GitHub releases.
- Community benchmarks: Submitted by users through our submission form. Each approved benchmark includes the software used and is displayed alongside estimated numbers.
Benchmark Submissions
Community-submitted benchmarks are the single most valuable data on this site — they're the one thing no AI summarizer or spec-sheet mirror can replicate. When you submit a benchmark, it goes into a review queue. Once approved, it appears on the relevant comparison page alongside the estimated number. Multiple benchmarks for the same combo are aggregated (median).