What quantization should I use for NVIDIA A100 SXM 80GB?

With 80GB, Q4_K_M is the community default. Q6_K or Q8_0 are options if quality matters more and the model fits.

How much VRAM does Yi 34B need?

Yi 34B has 34.39B parameters. At Q8_0 (8-bit), it requires approximately 36.5GB storage plus ~5.5GB for KV cache, totaling roughly 0GB.

How We Estimate LLM Hardware Compatibility

The Formula

Every "can I run this model on this hardware?" answer on this site is computed using the same formula — no guesswork, no marketing. Here's exactly what we calculate:

Reserve memory for the OS/driver. We subtract a fixed overhead from total memory depending on hardware class:
- Discrete GPUs: 0.8 GB (display buffer + driver overhead)
- Apple Silicon (unified memory): 4 GB (macOS + display)
- System RAM: 6 GB (OS + background processes)
Estimate KV cache memory. The KV cache stores attention keys and values during inference. We estimate it at 15% of the model's storage size at default context length. This is a heuristic — real KV cache usage varies with context length, batch size, and architecture.
Compute total required memory: model size + KV cache estimate.
Check fit: total required ≤ usable memory (total minus reserve).
Estimate speed: memory bandwidth (GB/s) ÷ active model size (GB) = tokens per second. This is a memory-bandwidth-bound estimate — for most local LLM inference, the GPU's compute is not the bottleneck; how fast you can read the model weights from VRAM is.

MoE Models: Why "Active Parameters" Matter

Mixture-of-Experts (MoE) models like Mixtral 8x7B, DeepSeek-R1, and Qwen3 MoE have more total parameters than they activate at any given time. For example, Mixtral 8x7B has 47B total parameters spread across 8 experts, but only 2 experts (~13B parameters) are used per token. This means:

Storage is driven by total parameters — you need enough memory to hold the entire model.
Speed is driven by active parameters — each token only reads the activated experts from memory.

Our speed estimates use active_params_billion for MoE models, which is why Mixtral 8x7B shows dramatically higher tok/s than a dense 47B model would — it's genuinely faster in practice.

Known Simplifications

Fixed reserve memory — real OS/driver overhead varies by display configuration, background applications, and OS version. Our constants are starting assumptions; community benchmarks are the best source of ground-truth at the margin.
KV cache as a flat percentage — in reality, KV cache grows linearly with context length and varies by model architecture. Our 15% estimate is for default (~8K) context. Longer contexts need more memory.
Memory bandwidth as the sole speed driver — this is accurate for most local LLM inference (which is memory-bandwidth-bound), but compute-bound scenarios (very small models, high batch sizes, prompt processing) may differ.
No multi-GPU modeling — we only model single-GPU configurations for now. Multi-GPU setups (2× RTX 4090, etc.) are on the roadmap.
Tuning backends differ — llama.cpp, Ollama, LM Studio, vLLM, and other inference engines have different memory footprints and performance characteristics even for the same model+hardware pair. Our estimates are backend-agnostic; community benchmarks are tagged with the software used.

Data Sources

Hardware specs: TechPowerUp GPU Database, NVIDIA official specs, AMD product pages, Apple official tech specs. Each hardware row links to its source.
Model parameters: HuggingFace model cards, official model release blogs and technical reports. Each model row links to its HuggingFace page.
Quantization sizes: Cross-referenced from actual GGUF file sizes published by the llama.cpp community (Bartowski, TheBloke, and others) and the llama.cpp GitHub releases.
Community benchmarks: Submitted by users through our submission form. Each approved benchmark includes the software used and is displayed alongside estimated numbers.

Benchmark Submissions

Community-submitted benchmarks are the single most valuable data on this site — they're the one thing no AI summarizer or spec-sheet mirror can replicate. When you submit a benchmark, it goes into a review queue. Once approved, it appears on the relevant comparison page alongside the estimated number. Multiple benchmarks for the same combo are aggregated (median).

Submit a Benchmark →