On-Premises Hardware Sizing Guide for LLM Inference
A comprehensive, actionable framework for sizing on-premises hardware for Large Language Model inference. Covers NVIDIA DGX Spark, H100, H200, and Intel Gaudi 3 with formulas, benchmarks, and decision matrices for 11 current-generation models.
1. Executive Summary
This guide provides a comprehensive, actionable framework for sizing on-premises hardware for Large Language Model (LLM) inference. It covers four major hardware platforms -- NVIDIA DGX Spark, NVIDIA H100, NVIDIA H200, and Intel Gaudi 3 -- and provides the formulas, benchmarks, and decision matrices needed to select the right hardware for your deployment.
All benchmark data has been updated to reflect the current generation of open-weight models as of March 2026, including Llama 4 (Scout/Maverick), Qwen 3.5, DeepSeek V3/R1, Kimi K2.5, GLM-5, Mistral Large 3, Mistral Small 4, and Phi-4. Most of these models use Mixture-of-Experts (MoE) architectures, which fundamentally changes sizing: total parameter counts are large (100B-1T+), but active parameters per token are much smaller (6B-40B), making them far more deployable than their headline sizes suggest.
Key Takeaways
| Decision Factor | Recommendation |
|---|---|
| Budget-constrained entry point | NVIDIA DGX Spark ($4,699) for models up to 200B total params (MoE) or ~34B dense |
| Best price-performance for inference | Intel Gaudi 3 (~$15,625/accelerator) at ~50% cost of H100 |
| Maximum single-GPU model capacity | NVIDIA H200 (141 GB HBM3e) -- fits Llama 4 Scout (109B MoE) on one GPU (FP8) |
| Highest throughput at scale | NVIDIA H200 8-GPU (~12,400 tok/s on Llama 4 Scout, ~2,864 tok/s on DeepSeek V3 FP8) |
| Large MoE deployment (670B-1T+) | H200 8-GPU (single node, FP8) for DeepSeek V3/Mistral Large 3/GLM-5/Kimi K2.5 |
2. Hardware Platform Specifications
2.1 Comparison Table
| Specification | DGX Spark | H100 SXM | H100 PCIe | H200 SXM | H200 NVL | Gaudi 3 OAM | Gaudi 3 PCIe |
|---|---|---|---|---|---|---|---|
| Architecture | GB10 Grace Blackwell | Hopper | Hopper | Hopper+ | Hopper+ | Gaudi 3 | Gaudi 3 |
| Process Node | 5nm / 4nm | 4nm | 4nm | 4nm | 4nm | 5nm | 5nm |
| Memory | 128 GB unified (LPDDR5x) | 80 GB HBM3 | 80 GB HBM2e | 141 GB HBM3e | 141 GB HBM3e | 128 GB HBM2e | 128 GB HBM2e |
| Memory Bandwidth | 273 GB/s | 3,350 GB/s | 2,000 GB/s | 4,800 GB/s | 4,800 GB/s | 3,700 GB/s | 3,700 GB/s |
| FP8 Compute | 1 PFLOP (FP4 w/ sparsity) | 3,958 TFLOPS | 2,000 TFLOPS | 3,958 TFLOPS | 3,958 TFLOPS | 1,835 TFLOPS | 1,835 TFLOPS |
| BF16 Compute | ~500 TFLOPS | 1,979 TFLOPS | 1,000 TFLOPS | 1,979 TFLOPS | 1,979 TFLOPS | 1,835 TFLOPS | 1,835 TFLOPS |
| Interconnect | ConnectX-7 | NVLink 4.0 (900 GB/s) | PCIe Gen5 | NVLink 4.0 (900 GB/s) | NVLink Bridge | 24x 200Gb RoCE | 24x 200Gb RoCE |
| TDP | 240W-500W+ | 700W | 350W | 700W | 600W | 900W | 600W |
| Form Factor | Desktop | SXM module | PCIe card | SXM module | PCIe card | OAM module | PCIe card |
| Price (per unit) | $4,699 | $35K-$40K | $25K-$30K | $30K-$40K | $30K-$35K | ~$15,625 | ~$15,625 |
| 8-GPU System | N/A (max 2) | ~$300K | ~$220K | ~$315K+ | ~$280K | ~$158K | ~$158K |
2.2 NVIDIA DGX Spark
- ChipGB10 Grace Blackwell Superchip
- CPU20 cores (10x X925 + 10x A725)
- Memory128 GB unified LPDDR5x
- AI PerfUp to 1 PFLOP (FP4), ~1,000 TOPS
- StorageUp to 4 TB NVMe SSD
- Price$4,699
- Max (single)~200B params (quantized MoE)
- Max (2 units)~400B+ params (FP4 MoE)
- CES 2026 UpdateUp to 2.5x perf improvement
2.3-2.4 NVIDIA H100 & H200
PCIe: Single-GPU inference, cost-sensitive deployments, existing PCIe infrastructure.
- Memory141 GB HBM3e (76% > H100)
- Bandwidth4,800 GB/s (43% > H100)
- Inference Gain37-45% higher throughput vs H100
- Long-ContextUp to 1.83-2.14x on long-context
- EnergySame 700W, ~50% better efficiency
- Architecture64 TPCs + GEMM Engines
- Memory128 GB HBM2e, 3,700 GB/s
- Compute1,835 TFLOPS FP8/BF16
- Networking24x 200Gb RoCE (saves ~$50K/node)
- Advantage50% lower cost than H100
3. Performance Benchmarks by Model Size
3.0 Current Model Landscape (March 2026)
The open-weight model landscape has shifted heavily toward Mixture-of-Experts (MoE) architectures.
| Model | Total Params | Active Params | Architecture | Context | Use Cases |
|---|---|---|---|---|---|
| Phi-4 | 14B | 14B (dense) | Dense Transformer | 16K | Code, reasoning, edge |
| Qwen 3.5-27B | 27B | 27B (dense) | Dense Transformer | 262K | General purpose, long-context |
| Qwen 3.5-397B | 397B | 17B | MoE (512 experts) | 262K-1M | Flagship, multimodal |
| Llama 4 Scout | 109B | 17B | MoE (16 experts) | 10M | Long-context, multimodal |
| Llama 4 Maverick | 400B | 17B | MoE (128 experts) | 1M | Reasoning, code, agentic |
| Mistral Small 4 | 119B | 6B | MoE (128 experts) | 128K | Efficient inference, edge |
| Mistral Large 3 | 675B | 41B | MoE | 256K | Frontier, agentic |
| DeepSeek V3 | 671B | 37B | MoE (MLA) | 128K | General purpose, reasoning |
| DeepSeek R1 | 671B | 37B | MoE (reasoning) | 128K | Deep reasoning, STEM |
| Kimi K2.5 | 1,040B | 32B | MoE (384 experts, MLA) | 256K | Agentic, visual intelligence |
| GLM-5 | 744B | 40B | MoE (Sparse Attn) | 128K+ | Agentic coding, reasoning |
3.1 DGX Spark Benchmarks
| Model | Precision | Batch | Prefill (tok/s) | Decode (tok/s) | Framework |
|---|---|---|---|---|---|
| Phi-4 14B | FP8 | 1 | ~3,000 (est.) | ~40 (est.) | SGLang |
| Qwen 3.5-27B | FP8 | 1 | ~2,500 (est.) | ~25 (est.) | vLLM |
| Llama 4 Scout 109B | FP4 | 1 | ~6,000 (est.) | ~35 (est.) | TensorRT-LLM |
| Mistral Small 4 119B | FP4 | 1 | ~7,000 (est.) | ~45 (est.) | TensorRT-LLM |
| Qwen 3 14B | NVFP4 | -- | 5,929 | -- | TensorRT-LLM |
| DeepSeek-R1 14B (distilled) | FP8 | 8 | 2,074 | 83.5 | SGLang |
| Qwen 3 235B-A22B (2x Spark) | FP4 | -- | 23,477 | -- | TensorRT-LLM |
3.2-3.3 H100 & H200 Benchmarks
| Model | GPUs | Precision | Throughput (tok/s) | Notes |
|---|---|---|---|---|
| Llama 4 Scout (17B active) | 1x H100 | INT4 | 120-150 | Single-GPU inference |
| Qwen 3.5-397B (17B active) | 4x H100 | FP8 | ~1,400 aggregate | GPUStack benchmark |
| DeepSeek V3 (37B active) | 8x H100 | AWQ INT4 | ~3,000 total | GitHub benchmarks |
| Llama 4 Scout (17B active) | 8x H200 | FP8 | 12,432 | ~1.5x vs H100 |
| Qwen 3.5-397B (17B active) | 4x H200 | FP8 | ~4,600 | ~3.3x vs 4xH100 |
| DeepSeek V3 (37B active) | 8x H200 | FP8 | 2,864 | Single node FP8 |
| Kimi K2.5 1T (32B active) | 8x H200 | INT4 | ~2,000-3,000 (est.) | Fits single node |
| GLM-5 744B (40B active) | 8x H200 | FP8 | ~1,215 output | Fits single node |
| Mistral Large 3 (41B active) | 8x H200 | FP8 | ~2,500-3,500 (est.) | Fits single node |
3.4 Intel Gaudi 3 Benchmarks
| Model | HPUs | Precision | Throughput (tok/s) |
|---|---|---|---|
| Llama 3.1 8B | 1 | FP8 | 20,705-24,535 |
| Llama 3.1 70B | 8 | FP8 | 18,428-21,448 |
| Llama 3.3 70B | 8 | FP8 | 18,714-21,473 |
| Llama 4 Scout 109B | 8 | FP8 | ~10,000-14,000 (est.) |
4. Memory Requirements & Quantization Impact
4.1 Model Weight Memory
| Model | Total Params | Active | FP16 | FP8 | INT4 |
|---|---|---|---|---|---|
| Phi-4 | 14B | 14B | 28 GB | 14 GB | 7 GB |
| Qwen 3.5-27B | 27B | 27B | 54 GB | 27 GB | 13.5 GB |
| Llama 4 Scout | 109B | 17B | 218 GB | 109 GB | ~55 GB |
| Qwen 3.5-397B | 397B | 17B | 794 GB | 397 GB | ~199 GB |
| DeepSeek V3 | 671B | 37B | 1,342 GB | 671 GB | ~336 GB |
| Mistral Large 3 | 675B | 41B | 1,350 GB | 675 GB | ~338 GB |
| GLM-5 | 744B | 40B | 1,488 GB | 744 GB | ~372 GB |
| Kimi K2.5 | 1,040B | 32B | 2,080 GB | 1,040 GB | ~595 GB |
4.2 Total VRAM Requirements
4.3 Quantization Impact
| Method | Bits | Memory Savings | Throughput Gain | Quality | Best For |
|---|---|---|---|---|---|
| FP16/BF16 | 16 | Baseline | Baseline | 100% | Maximum quality |
| FP8 | 8 | 50% | ~1.5-2.2x | ~99.9% | H100/H200 production |
| INT8 (W8A8) | 8 | 50% | ~1.5-2x | ~99.96% | General production |
| GPTQ-INT4 | 4 | 75% | ~2.7x | ~98.1% | Memory-constrained |
| AWQ-INT4 | 4 | 75% | ~2.7x | ~98.5% | Best INT4 quality |
| FP4/NVFP4 | 4 | 75% | ~3x | ~97% | DGX Spark / Blackwell |
The AI Strategy Blueprint
The comprehensive guide to enterprise AI infrastructure, deployment strategy, and organizational transformation. Covers hardware selection, model deployment, security architecture, and decision frameworks.
5. Concurrent User Sizing Methodology
5.2 Workload Profiles
| Workload | Avg Input | Avg Output | Latency | Tokens/Req |
|---|---|---|---|---|
| Chat | 500-2K | 200-500 | 5-15s | ~500 |
| Code completion | 200-1K | 50-200 | 1-3s | ~150 |
| Summarization | 2K-8K | 200-1K | 10-30s | ~1,000 |
| RAG | 1K-4K | 200-800 | 5-15s | ~800 |
| Agentic | 500-2K | 500-2K | 15-60s | ~2,000 |
| Batch | 1K-32K | 500-4K | Minutes | ~4,000 |
5.3 User Capacity by Hardware
| Hardware | Model | tok/s | Chat Users | Code Users |
|---|---|---|---|---|
| 1x DGX Spark | Phi-4 14B | ~40 | 0-1 | 1-2 |
| 1x H100 SXM | Llama 4 Scout (INT4) | ~120-150 | 2-3 | 5-8 |
| 8x H100 SXM | DeepSeek V3 (AWQ) | ~3,000 | 40-60 | 100-150 |
| 8x H200 SXM | Llama 4 Scout (FP8) | ~12,432 | 80-120 | 200-300 |
| 8x H200 SXM | DeepSeek V3 (FP8) | ~2,864 | 45-55 | 100-130 |
| 8x Gaudi 3 | Llama 3.3 70B | ~18K-21K | 35-50 | 80-120 |
6. KV Cache Memory Calculations
6.3 KV Cache by Context & Concurrency (Llama 4 Scout, FP16)
| Context | 1 User | 8 Users | 32 Users | 64 Users | 128 Users |
|---|---|---|---|---|---|
| 2K | 0.4 GB | 3 GB | 12 GB | 24 GB | 48 GB |
| 8K | 1.5 GB | 12 GB | 48 GB | 96 GB | 192 GB |
| 32K | 6 GB | 48 GB | 192 GB | 384 GB | 768 GB |
| 128K | 24 GB | 192 GB | 768 GB | 1,536 GB | 3,072 GB |
6.4 KV Cache Optimization
| Technique | Savings | Quality Impact | Recommendation |
|---|---|---|---|
| FP8 KV Cache | 50% | Negligible | Strongly recommended on H100/H200 |
| PagedAttention (vLLM) | 20-40% | None | Always use |
| MLA (DeepSeek/Kimi) | 70-90% | None (architectural) | Native to model |
| Sparse Attention (GLM-5) | ~6x | Minimal | Native to model |
7. Latency Requirements & SLOs
| Metric | Definition | Chat Target | Code Target |
|---|---|---|---|
| TTFT | Time to First Token | < 500ms | < 100ms |
| ITL | Inter-Token Latency | < 50ms (20+ tok/s) | < 30ms (33+ tok/s) |
| TPOT | Time Per Output Token | < 33ms (30+ tok/s) | < 20ms (50+ tok/s) |
| E2E | End-to-End Latency | < 10-15s | < 3s |
| tok/s | User Experience | Suitability |
|---|---|---|
| < 5 | Noticeably slow, frustrating | Batch only |
| 5-10 | Readable but sluggish | Long-form |
| 10-20 | Good streaming | Chat, RAG |
| 20-40 | Excellent, responsive | Code, chat |
| 40+ | Near-instantaneous | Real-time |
8. Model-to-Hardware Mapping
| Model | Min Hardware (FP16) | Recommended (FP8) | Budget (INT4) |
|---|---|---|---|
| Phi-4 14B | 1x H100 PCIe | 1x H100 / 1x Gaudi 3 | DGX Spark |
| Llama 4 Scout 109B | 4x H100 SXM | 2x H100 / 1x H200 | 1x H100 (INT4) |
| Qwen 3.5-397B | 16x H100 (2 nodes) | 8x H100 / 4x H200 | 4x H100 (INT4) |
| DeepSeek V3 671B | Multi-node H100 | 8x H200 (single node) | 8x H100 (AWQ INT4) |
| GLM-5 744B | Multi-node H100 | 8x H200 (FP8) | Not practical on H100 |
| Kimi K2.5 1T | Multi-node | 8x H200 (INT4) | 8x H200 (INT4, tight) |
8.3 DGX Spark Use Cases
| Use Case | Models | Performance |
|---|---|---|
| Dev & prototyping | Llama 4 Scout, Qwen 3.5-27B, Phi-4 | 25-150 tok/s decode |
| Fine-tuning (LoRA) | Up to Qwen 3.5-27B, Phi-4 | 760-7,000 tok/s training |
| Local inference (1 user) | Phi-4, Mistral Small 4 (FP4) | 25-80 tok/s decode |
| Air-gapped environments | Any MoE up to ~200B (Q4) | Slow but functional |
Get Chapter 1 Free + AI Academy Access
Download the first chapter of The AI Strategy Blueprint and get instant access to our AI Academy -- covering infrastructure planning, model selection, and deployment frameworks.
9. Multi-GPU Scaling Configurations
| Strategy | When to Use | Communication | Overhead |
|---|---|---|---|
| Tensor Parallelism (TP) | Within a node (NVLink) | 900 GB/s | Low (5-15%) |
| Pipeline Parallelism (PP) | Across nodes | InfiniBand/RoCE | Medium (10-30%) |
| Data Parallelism | Independent requests | Minimal | None per-request |
| Expert Parallelism (EP) | MoE models | NVLink/InfiniBand | Model-dependent |
9.2 Performance Scaling
| Config | Model (FP8) | Throughput | Memory | Investment |
|---|---|---|---|---|
| 1x H100 | Llama 4 Scout (INT4) | ~120-150 tok/s | 80 GB | $35-40K |
| 4x H100 | Qwen 3 235B (FP8) | ~1,400 tok/s | 320 GB | $140-160K |
| 8x H100 | DeepSeek V3 (AWQ) | ~3,000 tok/s | 640 GB | $300K |
| 4x H200 | Qwen 3.5-397B (FP8) | ~4,600 tok/s | 564 GB | $140-175K |
| 8x H200 | DeepSeek V3 (FP8) | ~2,864 tok/s | 1,128 GB | $315K |
| 8x H200 | Llama 4 Scout (FP8) | ~12,432 tok/s | 1,128 GB | $315K |
| 8x Gaudi 3 | Llama 3.3 70B (FP8) | ~18K-21K tok/s | 1,024 GB | ~$158K |
10. Power, Cooling & Data Center Requirements
| Configuration | Per-GPU TDP | System Total | Annual Cost (@$0.10/kWh) |
|---|---|---|---|
| 1x DGX Spark | ~500W | ~500W | ~$440 |
| 8x H100 SXM (DGX H100) | 5,600W GPU | ~10,200W | ~$8,935 |
| 8x H200 SXM (HGX H200) | 5,600W GPU | ~10,200W | ~$8,935 |
| 8x Gaudi 3 OAM | 7,200W GPU | ~10,500W | ~$9,198 |
| Power Range | Cooling Method | Notes |
|---|---|---|
| < 1 kW | Standard office HVAC | Desktop, no special cooling |
| 1-5 kW | Standard rack air cooling | 42U rack, adequate airflow |
| 5-10 kW | Enhanced air / rear-door HX | Hot/cold aisle recommended |
| 10-20 kW | Direct liquid cooling recommended | 70-75% heat via liquid |
| 20+ kW | Direct liquid cooling mandatory | Supply 40C / return 50C |
11. Total Cost of Ownership (TCO) Analysis
11.3 Three-Year TCO Comparison
| Config | Model | Hardware | 3-Year OpEx | 3-Year TCO | Cost per tok/s |
|---|---|---|---|---|---|
| 1x DGX Spark | Phi-4 14B | $4,699 | $63K | $67.7K | $1,693 (40 tok/s) |
| 8x H100 SXM | DeepSeek V3 (AWQ) | $300K | $420K | $720K | $240 (3,000 tok/s) |
| 8x H200 SXM | Llama 4 Scout (FP8) | $350K | $420K | $770K | $62 (12,432 tok/s) |
| 8x Gaudi 3 | Llama 3.3 70B (FP8) | $158K | $370K | $528K | $25-29 (18K-21K tok/s) |
11.4 Self-Hosting Break-Even
12. Sizing Calculator & Formulas
12.3-12.4 Quick Sizing Tables
| Users | Llama 4 Scout Min | Scout Recommended | DeepSeek V3 Min | V3 Recommended |
|---|---|---|---|---|
| 1-5 | 1x H200 | 2x H100 | 8x H200 | 8x H200 |
| 15-50 | 4x H100 | 8x H200 | 8x H200 | 2x 8-GPU H200 |
| 50-100 | 8x H200 | 8x H200 | 2x 8-GPU H200 | 3x 8-GPU H200 |
| 200-500 | 2x 8-GPU H200 | 4x 8-GPU H200 | 4x 8-GPU H200 | 8x 8-GPU H200 |
13. Workload-Specific Recommendations
| Latency | TTFT < 500ms, ITL < 50ms |
| Model | Llama 4 Scout, Qwen 3 235B, DeepSeek V3 |
| Target | 20-40 tok/s per user |
| Best HW | H200 SXM |
| Latency | TTFT < 100ms, ITL < 30ms |
| Model | Phi-4 14B, Qwen 3.5-27B |
| Key | Latency-sensitive, high concurrency |
| Best HW | H100 SXM |
| Latency | TTFT < 1s, ITL < 50ms |
| Model | Qwen 3.5-27B, Llama 4 Scout |
| Key | Long input handling (4K-32K) |
| Best HW | H200 SXM (141 GB for KV cache) |
| Latency | E2E < 60s per step |
| Model | DeepSeek V3, Kimi K2.5, GLM-5 |
| Key | Quality > speed |
| Best HW | 8x H200 SXM |
| Priority | Minimize total processing time |
| Model | Any (Phi-4 to Kimi K2.5) |
| Optimization | Large batches, FP8, EAGLE |
| Best HW | 8x H200 or 8x Gaudi 3 |
14. Decision Framework
14.1 Budget-Based Selection
14.2 Platform Scorecard
| Criteria (1-5) | DGX Spark | H100 SXM | H200 SXM | Gaudi 3 |
|---|---|---|---|---|
| Inference speed | 2 | 4 | 5 | 3.5 |
| Memory capacity | 3 | 3 | 5 | 4 |
| Price-performance | 2 | 3 | 4 | 5 |
| Software ecosystem | 4 | 5 | 5 | 2.5 |
| Ease of deployment | 5 | 3 | 3 | 2 |
| Multi-GPU scaling | 1 | 5 | 5 | 3.5 |
| Max model size | 3 | 4 | 5 | 4 |
14.3 When to Choose Each Platform
- Budget under $10K
- Single-developer prototyping
- Air-gapped / edge environments
- Fine-tuning up to 27B (QLoRA)
- No data center required
- Broadest software ecosystem
- Phi-4 to Llama 4 Scout production
- Battle-tested infrastructure
- Multi-GPU tensor parallelism
- 141 GB single-GPU capacity
- 670B-1T+ MoE models
- Highest inference throughput
- Long context (128K+ tokens)
- Price-performance priority
- Standard Llama family models
- Integrated networking saves $50K+/node
- Budget-constrained production
Need Expert Help Sizing Your AI Infrastructure?
Our AI Strategy Consulting team helps organizations deploy on-premises LLM infrastructure.
Frequently Asked Questions
15. Sources & References
Hardware Specs & Reviews
- NVIDIA DGX Spark Hardware Overview
- DGX Spark In-Depth Review (LMSYS)
- NVIDIA H100 Official Page
- NVIDIA H200 Official Page
- Intel Gaudi 3 White Paper
Benchmarks
- NVIDIA: Llama 4 Scout & Maverick Inference
- Llama 4 in vLLM
- DeepSeek V3 H200 Benchmarking (Verda)
- Qwen3-235B on H100 (GPUStack)
- MLPerf Inference v5.1 Results
Models
Sizing & Infrastructure
- LLM Inference Sizing (VMware)
- Lenovo LLM Sizing Guide
- Mastering LLM Inference (NVIDIA)
- LLM Quantization Guide (AI Multiple)