Research Report — v2.0 — March 2026

On-Premises Hardware Sizing Guide for LLM Inference

Q: How much VRAM do I need to run DeepSeek V3 on-premises?

DeepSeek V3 has 671B total parameters. In FP8, model weights require 671 GB. With overhead (30-50%), you need ~870-1,000 GB total. Recommended: 8x NVIDIA H200 SXM (1,128 GB). On H100, you need 16 GPUs across two nodes (FP8) or 8 with aggressive INT4 quantization.

Q: Is the DGX Spark suitable for production LLM inference?

DGX Spark is best for development, prototyping, and single-user inference. Its 273 GB/s bandwidth limits decode to 2-50 tok/s. For production with multiple concurrent users, you need datacenter GPUs (H100, H200, Gaudi 3) with 10-17x higher bandwidth.

Q: What is the difference between total and active parameters in MoE models?

Total parameters include all expert networks; active parameters are the subset used per token. Example: Llama 4 Scout has 109B total but only 17B active (1 of 16 experts per token). All params must be in VRAM (memory req), but only active params affect compute/bandwidth per token.

Q: When does self-hosting become more cost-effective than APIs?

Self-hosting generally becomes cost-effective when monthly API spend exceeds $12,000-$19,000, accounting for hardware, power, cooling, staff, and maintenance. At 10M+ tokens/day, self-hosting is significantly cheaper. Data privacy requirements may necessitate self-hosting regardless of cost.

Q: What quantization should I use for production?

FP8 with TensorRT-LLM on H100/H200 is recommended. It provides 50% memory savings with ~99.9% quality retention. If memory-constrained, AWQ-INT4 offers 75% savings at ~98.5% quality. On DGX Spark, NVFP4 is optimal.

Q: How many concurrent users can 8x H200 support?

For Llama 4 Scout (FP8): 80-120 concurrent chat users at ~12,432 tok/s. For DeepSeek V3 (FP8): 45-55 chat users at ~2,864 tok/s. "Concurrent" means actively waiting -- with 1:10-1:20 active-to-total ratios, 50 concurrent serves 500-1,000 total users.

A comprehensive, actionable framework for sizing on-premises hardware for Large Language Model inference. Covers NVIDIA DGX Spark, H100, H200, and Intel Gaudi 3 with formulas, benchmarks, and decision matrices for 11 current-generation models.

45 min read

4 Hardware Platforms

11 LLM Models Benchmarked

30+ Comparison Tables

9 Sizing Formulas

141 GBH200 HBM3e per GPU

12,432tok/s (8xH200 + Scout)

$4,699DGX Spark Entry Point

1T+Max Model Params (MoE)

1. Executive Summary

This guide provides a comprehensive, actionable framework for sizing on-premises hardware for Large Language Model (LLM) inference. It covers four major hardware platforms -- NVIDIA DGX Spark, NVIDIA H100, NVIDIA H200, and Intel Gaudi 3 -- and provides the formulas, benchmarks, and decision matrices needed to select the right hardware for your deployment.

All benchmark data has been updated to reflect the current generation of open-weight models as of March 2026, including Llama 4 (Scout/Maverick), Qwen 3.5, DeepSeek V3/R1, Kimi K2.5, GLM-5, Mistral Large 3, Mistral Small 4, and Phi-4. Most of these models use Mixture-of-Experts (MoE) architectures, which fundamentally changes sizing: total parameter counts are large (100B-1T+), but active parameters per token are much smaller (6B-40B), making them far more deployable than their headline sizes suggest.

Key Takeaways

Decision Factor	Recommendation
Budget-constrained entry point	NVIDIA DGX Spark ($4,699) for models up to 200B total params (MoE) or ~34B dense
Best price-performance for inference	Intel Gaudi 3 (~$15,625/accelerator) at ~50% cost of H100
Maximum single-GPU model capacity	NVIDIA H200 (141 GB HBM3e) -- fits Llama 4 Scout (109B MoE) on one GPU (FP8)
Highest throughput at scale	NVIDIA H200 8-GPU (~12,400 tok/s on Llama 4 Scout, ~2,864 tok/s on DeepSeek V3 FP8)
Large MoE deployment (670B-1T+)	H200 8-GPU (single node, FP8) for DeepSeek V3/Mistral Large 3/GLM-5/Kimi K2.5

2. Hardware Platform Specifications

2.1 Comparison Table

Specification	DGX Spark	H100 SXM	H100 PCIe	H200 SXM	H200 NVL	Gaudi 3 OAM	Gaudi 3 PCIe
Architecture	GB10 Grace Blackwell	Hopper	Hopper	Hopper+	Hopper+	Gaudi 3	Gaudi 3
Process Node	5nm / 4nm	4nm	4nm	4nm	4nm	5nm	5nm
Memory	128 GB unified (LPDDR5x)	80 GB HBM3	80 GB HBM2e	141 GB HBM3e	141 GB HBM3e	128 GB HBM2e	128 GB HBM2e
Memory Bandwidth	273 GB/s	3,350 GB/s	2,000 GB/s	4,800 GB/s	4,800 GB/s	3,700 GB/s	3,700 GB/s
FP8 Compute	1 PFLOP (FP4 w/ sparsity)	3,958 TFLOPS	2,000 TFLOPS	3,958 TFLOPS	3,958 TFLOPS	1,835 TFLOPS	1,835 TFLOPS
BF16 Compute	~500 TFLOPS	1,979 TFLOPS	1,000 TFLOPS	1,979 TFLOPS	1,979 TFLOPS	1,835 TFLOPS	1,835 TFLOPS
Interconnect	ConnectX-7	NVLink 4.0 (900 GB/s)	PCIe Gen5	NVLink 4.0 (900 GB/s)	NVLink Bridge	24x 200Gb RoCE	24x 200Gb RoCE
TDP	240W-500W+	700W	350W	700W	600W	900W	600W
Form Factor	Desktop	SXM module	PCIe card	SXM module	PCIe card	OAM module	PCIe card
Price (per unit)	$4,699	$35K-$40K	$25K-$30K	$30K-$40K	$30K-$35K	~$15,625	~$15,625
8-GPU System	N/A (max 2)	~$300K	~$220K	~$315K+	~$280K	~$158K	~$158K

2.2 NVIDIA DGX Spark

Core Hardware

ChipGB10 Grace Blackwell Superchip
CPU20 cores (10x X925 + 10x A725)
Memory128 GB unified LPDDR5x
AI PerfUp to 1 PFLOP (FP4), ~1,000 TOPS
StorageUp to 4 TB NVMe SSD
Price$4,699

Model Capacity

Max (single)~200B params (quantized MoE)
Max (2 units)~400B+ params (FP4 MoE)
CES 2026 UpdateUp to 2.5x perf improvement

2.3-2.4 NVIDIA H100 & H200

SXM vs PCIe

SXM: Large MoE models requiring multi-GPU tensor parallelism, maximum throughput.
PCIe: Single-GPU inference, cost-sensitive deployments, existing PCIe infrastructure.

H200 Key Advantages

Memory141 GB HBM3e (76% > H100)
Bandwidth4,800 GB/s (43% > H100)
Inference Gain37-45% higher throughput vs H100
Long-ContextUp to 1.83-2.14x on long-context
EnergySame 700W, ~50% better efficiency

Intel Gaudi 3

Architecture64 TPCs + GEMM Engines
Memory128 GB HBM2e, 3,700 GB/s
Compute1,835 TFLOPS FP8/BF16
Networking24x 200Gb RoCE (saves ~$50K/node)
Advantage50% lower cost than H100

3. Performance Benchmarks by Model Size

3.0 Current Model Landscape (March 2026)

The open-weight model landscape has shifted heavily toward Mixture-of-Experts (MoE) architectures.

Model	Total Params	Active Params	Architecture	Context	Use Cases
Phi-4	14B	14B (dense)	Dense Transformer	16K	Code, reasoning, edge
Qwen 3.5-27B	27B	27B (dense)	Dense Transformer	262K	General purpose, long-context
Qwen 3.5-397B	397B	17B	MoE (512 experts)	262K-1M	Flagship, multimodal
Llama 4 Scout	109B	17B	MoE (16 experts)	10M	Long-context, multimodal
Llama 4 Maverick	400B	17B	MoE (128 experts)	1M	Reasoning, code, agentic
Mistral Small 4	119B	6B	MoE (128 experts)	128K	Efficient inference, edge
Mistral Large 3	675B	41B	MoE	256K	Frontier, agentic
DeepSeek V3	671B	37B	MoE (MLA)	128K	General purpose, reasoning
DeepSeek R1	671B	37B	MoE (reasoning)	128K	Deep reasoning, STEM
Kimi K2.5	1,040B	32B	MoE (384 experts, MLA)	256K	Agentic, visual intelligence
GLM-5	744B	40B	MoE (Sparse Attn)	128K+	Agentic coding, reasoning

3.1 DGX Spark Benchmarks

Model	Precision	Batch	Prefill (tok/s)	Decode (tok/s)	Framework
Phi-4 14B	FP8	1	~3,000 (est.)	~40 (est.)	SGLang
Qwen 3.5-27B	FP8	1	~2,500 (est.)	~25 (est.)	vLLM
Llama 4 Scout 109B	FP4	1	~6,000 (est.)	~35 (est.)	TensorRT-LLM
Mistral Small 4 119B	FP4	1	~7,000 (est.)	~45 (est.)	TensorRT-LLM
Qwen 3 14B	NVFP4	--	5,929	--	TensorRT-LLM
DeepSeek-R1 14B (distilled)	FP8	8	2,074	83.5	SGLang
Qwen 3 235B-A22B (2x Spark)	FP4	--	23,477	--	TensorRT-LLM

Key Insight

DGX Spark excels at prefill but is limited on decode (273 GB/s bandwidth). Expect 2-50 tok/s decode. MoE models with low active params (6B-17B) run efficiently. CES 2026 updates delivered up to 2.5x improvements.

3.2-3.3 H100 & H200 Benchmarks

Model	GPUs	Precision	Throughput (tok/s)	Notes
Llama 4 Scout (17B active)	1x H100	INT4	120-150	Single-GPU inference
Qwen 3.5-397B (17B active)	4x H100	FP8	~1,400 aggregate	GPUStack benchmark
DeepSeek V3 (37B active)	8x H100	AWQ INT4	~3,000 total	GitHub benchmarks
Llama 4 Scout (17B active)	8x H200	FP8	12,432	~1.5x vs H100
Qwen 3.5-397B (17B active)	4x H200	FP8	~4,600	~3.3x vs 4xH100
DeepSeek V3 (37B active)	8x H200	FP8	2,864	Single node FP8
Kimi K2.5 1T (32B active)	8x H200	INT4	~2,000-3,000 (est.)	Fits single node
GLM-5 744B (40B active)	8x H200	FP8	~1,215 output	Fits single node
Mistral Large 3 (41B active)	8x H200	FP8	~2,500-3,500 (est.)	Fits single node

Key Insight

8xH200 (1,128 GB) fits all current MoE models in a single node. This is the primary advantage over H100, where 8xH100 (640 GB) cannot fit 670B+ models without aggressive quantization.

3.4 Intel Gaudi 3 Benchmarks

Model	HPUs	Precision	Throughput (tok/s)
Llama 3.1 8B	1	FP8	20,705-24,535
Llama 3.1 70B	8	FP8	18,428-21,448
Llama 3.3 70B	8	FP8	18,714-21,473
Llama 4 Scout 109B	8	FP8	~10,000-14,000 (est.)

Key Insight

Gaudi 3 achieves 95-170% of H100 performance at ~50% hardware cost. Software ecosystem is expanding but less mature than NVIDIA.

4. Memory Requirements & Quantization Impact

4.1 Model Weight Memory

Model	Total Params	Active	FP16	FP8	INT4
Phi-4	14B	14B	28 GB	14 GB	7 GB
Qwen 3.5-27B	27B	27B	54 GB	27 GB	13.5 GB
Llama 4 Scout	109B	17B	218 GB	109 GB	~55 GB
Qwen 3.5-397B	397B	17B	794 GB	397 GB	~199 GB
DeepSeek V3	671B	37B	1,342 GB	671 GB	~336 GB
Mistral Large 3	675B	41B	1,350 GB	675 GB	~338 GB
GLM-5	744B	40B	1,488 GB	744 GB	~372 GB
Kimi K2.5	1,040B	32B	2,080 GB	1,040 GB	~595 GB

4.2 Total VRAM Requirements

FormulaTotal VRAM = Model Weights + KV Cache + Activations + Framework Overhead

Practical Rule

Add 30-50% to model weight size for KV cache, activations, and framework overhead.

4.3 Quantization Impact

Method	Bits	Memory Savings	Throughput Gain	Quality	Best For
FP16/BF16	16	Baseline	Baseline	100%	Maximum quality
FP8	8	50%	~1.5-2.2x	~99.9%	H100/H200 production
INT8 (W8A8)	8	50%	~1.5-2x	~99.96%	General production
GPTQ-INT4	4	75%	~2.7x	~98.1%	Memory-constrained
AWQ-INT4	4	75%	~2.7x	~98.5%	Best INT4 quality
FP4/NVFP4	4	75%	~3x	~97%	DGX Spark / Blackwell

5. Concurrent User Sizing Methodology

Core FormulaRequired Throughput (tok/s) = Concurrent Users x Avg Output Tokens / Target Response Time (s)

5.2 Workload Profiles

Workload	Avg Input	Avg Output	Latency	Tokens/Req
Chat	500-2K	200-500	5-15s	~500
Code completion	200-1K	50-200	1-3s	~150
Summarization	2K-8K	200-1K	10-30s	~1,000
RAG	1K-4K	200-800	5-15s	~800
Agentic	500-2K	500-2K	15-60s	~2,000
Batch	1K-32K	500-4K	Minutes	~4,000

5.3 User Capacity by Hardware

Hardware	Model	tok/s	Chat Users	Code Users
1x DGX Spark	Phi-4 14B	~40	0-1	1-2
1x H100 SXM	Llama 4 Scout (INT4)	~120-150	2-3	5-8
8x H100 SXM	DeepSeek V3 (AWQ)	~3,000	40-60	100-150
8x H200 SXM	Llama 4 Scout (FP8)	~12,432	80-120	200-300
8x H200 SXM	DeepSeek V3 (FP8)	~2,864	45-55	100-130
8x Gaudi 3	Llama 3.3 70B	~18K-21K	35-50	80-120

Important

"Concurrent users" means actively waiting for a response. Typical active-to-total ratio is 1:10 to 1:20.

6. KV Cache Memory Calculations

KV Cache per TokenKV Cache (bytes) = 2 x num_layers x num_kv_heads x head_dim x bytes_per_element

MLA Models

DeepSeek V3 and Kimi K2.5 use Multi-head Latent Attention, compressing KV cache by 70-90%.

6.3 KV Cache by Context & Concurrency (Llama 4 Scout, FP16)

Context	1 User	8 Users	32 Users	64 Users	128 Users
2K	0.4 GB	3 GB	12 GB	24 GB	48 GB
8K	1.5 GB	12 GB	48 GB	96 GB	192 GB
32K	6 GB	48 GB	192 GB	384 GB	768 GB
128K	24 GB	192 GB	768 GB	1,536 GB	3,072 GB

6.4 KV Cache Optimization

Technique	Savings	Quality Impact	Recommendation
FP8 KV Cache	50%	Negligible	Strongly recommended on H100/H200
PagedAttention (vLLM)	20-40%	None	Always use
MLA (DeepSeek/Kimi)	70-90%	None (architectural)	Native to model
Sparse Attention (GLM-5)	~6x	Minimal	Native to model

7. Latency Requirements & SLOs

Metric	Definition	Chat Target	Code Target
TTFT	Time to First Token	< 500ms	< 100ms
ITL	Inter-Token Latency	< 50ms (20+ tok/s)	< 30ms (33+ tok/s)
TPOT	Time Per Output Token	< 33ms (30+ tok/s)	< 20ms (50+ tok/s)
E2E	End-to-End Latency	< 10-15s	< 3s

tok/s	User Experience	Suitability
< 5	Noticeably slow, frustrating	Batch only
5-10	Readable but sluggish	Long-form
10-20	Good streaming	Chat, RAG
20-40	Excellent, responsive	Code, chat
40+	Near-instantaneous	Real-time

Human Reading Speed

~250 words/min = ~6 tokens/second. Model should generate at least 6 tok/s for streaming chat.

Prefill (compute-bound)TTFT = (Input Tokens x Active Parameters x 2 FLOP) / GPU Compute (FLOPS)

Decode (bandwidth-bound)TPOT = (Total Model Weights in bytes x 2) / Memory Bandwidth (GB/s)

8. Model-to-Hardware Mapping

Model	Min Hardware (FP16)	Recommended (FP8)	Budget (INT4)
Phi-4 14B	1x H100 PCIe	1x H100 / 1x Gaudi 3	DGX Spark
Llama 4 Scout 109B	4x H100 SXM	2x H100 / 1x H200	1x H100 (INT4)
Qwen 3.5-397B	16x H100 (2 nodes)	8x H100 / 4x H200	4x H100 (INT4)
DeepSeek V3 671B	Multi-node H100	8x H200 (single node)	8x H100 (AWQ INT4)
GLM-5 744B	Multi-node H100	8x H200 (FP8)	Not practical on H100
Kimi K2.5 1T	Multi-node	8x H200 (INT4)	8x H200 (INT4, tight)

8.3 DGX Spark Use Cases

Use Case	Models	Performance
Dev & prototyping	Llama 4 Scout, Qwen 3.5-27B, Phi-4	25-150 tok/s decode
Fine-tuning (LoRA)	Up to Qwen 3.5-27B, Phi-4	760-7,000 tok/s training
Local inference (1 user)	Phi-4, Mistral Small 4 (FP4)	25-80 tok/s decode
Air-gapped environments	Any MoE up to ~200B (Q4)	Slow but functional

Free Download

Get Chapter 1 Free + AI Academy Access

Download the first chapter of The AI Strategy Blueprint and get instant access to our AI Academy -- covering infrastructure planning, model selection, and deployment frameworks.

9. Multi-GPU Scaling Configurations

Strategy	When to Use	Communication	Overhead
Tensor Parallelism (TP)	Within a node (NVLink)	900 GB/s	Low (5-15%)
Pipeline Parallelism (PP)	Across nodes	InfiniBand/RoCE	Medium (10-30%)
Data Parallelism	Independent requests	Minimal	None per-request
Expert Parallelism (EP)	MoE models	NVLink/InfiniBand	Model-dependent

9.2 Performance Scaling

Config	Model (FP8)	Throughput	Memory	Investment
1x H100	Llama 4 Scout (INT4)	~120-150 tok/s	80 GB	$35-40K
4x H100	Qwen 3 235B (FP8)	~1,400 tok/s	320 GB	$140-160K
8x H100	DeepSeek V3 (AWQ)	~3,000 tok/s	640 GB	$300K
4x H200	Qwen 3.5-397B (FP8)	~4,600 tok/s	564 GB	$140-175K
8x H200	DeepSeek V3 (FP8)	~2,864 tok/s	1,128 GB	$315K
8x H200	Llama 4 Scout (FP8)	~12,432 tok/s	1,128 GB	$315K
8x Gaudi 3	Llama 3.3 70B (FP8)	~18K-21K tok/s	1,024 GB	~$158K

Recommendation

Always use NVLink (SXM) for tensor parallelism. PCIe is acceptable only for single-GPU deployments.

10. Power, Cooling & Data Center Requirements

Configuration	Per-GPU TDP	System Total	Annual Cost (@$0.10/kWh)
1x DGX Spark	~500W	~500W	~$440
8x H100 SXM (DGX H100)	5,600W GPU	~10,200W	~$8,935
8x H200 SXM (HGX H200)	5,600W GPU	~10,200W	~$8,935
8x Gaudi 3 OAM	7,200W GPU	~10,500W	~$9,198

Power Range	Cooling Method	Notes
< 1 kW	Standard office HVAC	Desktop, no special cooling
1-5 kW	Standard rack air cooling	42U rack, adequate airflow
5-10 kW	Enhanced air / rear-door HX	Hot/cold aisle recommended
10-20 kW	Direct liquid cooling recommended	70-75% heat via liquid
20+ kW	Direct liquid cooling mandatory	Supply 40C / return 50C

11. Total Cost of Ownership (TCO) Analysis

11.3 Three-Year TCO Comparison

Config	Model	Hardware	3-Year OpEx	3-Year TCO	Cost per tok/s
1x DGX Spark	Phi-4 14B	$4,699	$63K	$67.7K	$1,693 (40 tok/s)
8x H100 SXM	DeepSeek V3 (AWQ)	$300K	$420K	$720K	$240 (3,000 tok/s)
8x H200 SXM	Llama 4 Scout (FP8)	$350K	$420K	$770K	$62 (12,432 tok/s)
8x Gaudi 3	Llama 3.3 70B (FP8)	$158K	$370K	$528K	$25-29 (18K-21K tok/s)

11.4 Self-Hosting Break-Even

Break-even Rule

Self-hosting becomes cost-effective when monthly API spend exceeds $12,000-$19,000.

12. Sizing Calculator & Formulas

F1: Model Weight MemoryVRAM_weights (GB) = Total Parameters (B) x Bytes_per_Parameter

F2: KV Cache MemoryKV_cache (GB) = 2 x layers x kv_heads x head_dim x seq_len x batch x bytes / (1024^3)

F3: Total GPU MemoryTotal = Weights + KV_cache + (0.1 x Weights) + (0.05 x Weights)

F4: Max Throughput (decode)Max_tok/s = Memory_Bandwidth / Model_Size_in_Memory

F5: Required ThroughputRequired_tok/s = Users x Avg_Output / Target_Latency x 1.3

F6: GPUs NeededGPUs_needed = ceil(Total_VRAM / GPU_Memory)

F7: Max Concurrent UsersMax_Users = (Total_GPU_Mem - Model_Weights) / KV_cache_per_user

12.3-12.4 Quick Sizing Tables

Users	Llama 4 Scout Min	Scout Recommended	DeepSeek V3 Min	V3 Recommended
1-5	1x H200	2x H100	8x H200	8x H200
15-50	4x H100	8x H200	8x H200	2x 8-GPU H200
50-100	8x H200	8x H200	2x 8-GPU H200	3x 8-GPU H200
200-500	2x 8-GPU H200	4x 8-GPU H200	4x 8-GPU H200	8x 8-GPU H200

13. Workload-Specific Recommendations

Chat / Conversational AI

Latency	TTFT < 500ms, ITL < 50ms
Model	Llama 4 Scout, Qwen 3 235B, DeepSeek V3
Target	20-40 tok/s per user
Best HW	H200 SXM

Code Generation

Latency	TTFT < 100ms, ITL < 30ms
Model	Phi-4 14B, Qwen 3.5-27B
Key	Latency-sensitive, high concurrency
Best HW	H100 SXM

RAG

Latency	TTFT < 1s, ITL < 50ms
Model	Qwen 3.5-27B, Llama 4 Scout
Key	Long input handling (4K-32K)
Best HW	H200 SXM (141 GB for KV cache)

Agentic / Tool-Use

Latency	E2E < 60s per step
Model	DeepSeek V3, Kimi K2.5, GLM-5
Key	Quality > speed
Best HW	8x H200 SXM

Batch Processing

Priority	Minimize total processing time
Model	Any (Phi-4 to Kimi K2.5)
Optimization	Large batches, FP8, EAGLE
Best HW	8x H200 or 8x Gaudi 3

14. Decision Framework

14.1 Budget-Based Selection

< $10KDGX Spark -- Phi-4, Qwen 3.5-27B, Mistral Small 4 (dev/prototype)

$10K-$100K1-2x H100 PCIe or Gaudi 3 -- <10 users

$100K-$300K4-8x H100 SXM or 8x Gaudi 3 -- 10-50 users

$300K-$500K8x H200 SXM -- DeepSeek V3, GLM-5, Kimi K2.5

$500K+Multi-node H200 or Gaudi 3 -- 100+ users

14.2 Platform Scorecard

Criteria (1-5)	DGX Spark	H100 SXM	H200 SXM	Gaudi 3
Inference speed	2	4	5	3.5
Memory capacity	3	3	5	4
Price-performance	2	3	4	5
Software ecosystem	4	5	5	2.5
Ease of deployment	5	3	3	2
Multi-GPU scaling	1	5	5	3.5
Max model size	3	4	5	4

14.3 When to Choose Each Platform

DGX Spark

Budget under $10K
Single-developer prototyping
Air-gapped / edge environments
Fine-tuning up to 27B (QLoRA)
No data center required

H100 SXM

Broadest software ecosystem
Phi-4 to Llama 4 Scout production
Battle-tested infrastructure
Multi-GPU tensor parallelism

H200 SXM

141 GB single-GPU capacity
670B-1T+ MoE models
Highest inference throughput
Long context (128K+ tokens)

Intel Gaudi 3

Price-performance priority
Standard Llama family models
Integrated networking saves $50K+/node
Budget-constrained production

Need Expert Help Sizing Your AI Infrastructure?

Our AI Strategy Consulting team helps organizations deploy on-premises LLM infrastructure.

$566K+Bundled Tech Value

78xAccuracy Improvement

6Clients per Year

Masterclass

$2,497

AI strategy training and hardware selection guidance

AI Strategy Sprint

$50,000

6-week: infrastructure assessment, model selection, deployment roadmap

Transformation Program

$150,000

End-to-end: procurement, deployment, optimization

Founder's Circle

$750K-$1.5M

Full enterprise transformation with dedicated team

Explore AI Strategy Consulting

Frequently Asked Questions

How much VRAM do I need to run DeepSeek V3 on-premises?

DeepSeek V3 has 671B total parameters. In FP8, model weights require 671 GB. With overhead (30-50%), you need ~870-1,000 GB total. Recommended: 8x NVIDIA H200 SXM (1,128 GB). On H100, you need 16 GPUs across two nodes (FP8) or 8 with aggressive INT4 quantization.

Is the DGX Spark suitable for production LLM inference?

DGX Spark is best for development, prototyping, and single-user inference. Its 273 GB/s bandwidth limits decode to 2-50 tok/s. For production with multiple concurrent users, you need datacenter GPUs (H100, H200, Gaudi 3) with 10-17x higher bandwidth.

What is the difference between total and active parameters in MoE models?

Total parameters include all expert networks; active parameters are the subset used per token. Example: Llama 4 Scout has 109B total but only 17B active (1 of 16 experts per token). All params must be in VRAM (memory req), but only active params affect compute/bandwidth per token.

When does self-hosting become more cost-effective than APIs?

Self-hosting generally becomes cost-effective when monthly API spend exceeds $12,000-$19,000, accounting for hardware, power, cooling, staff, and maintenance. At 10M+ tokens/day, self-hosting is significantly cheaper. Data privacy requirements may necessitate self-hosting regardless of cost.

How does Intel Gaudi 3 compare to NVIDIA H100?

Gaudi 3 achieves 95-170% of H100 performance at ~50% hardware cost. For Llama 70B at 8-accelerator scale: Gaudi 3 delivers ~18K-21K tok/s vs H100's ~22K tok/s. Trade-off is software ecosystem maturity -- NVIDIA has broader model support via vLLM, SGLang, and TensorRT-LLM.

What quantization should I use for production?

FP8 with TensorRT-LLM on H100/H200 is recommended. It provides 50% memory savings with ~99.9% quality retention. If memory-constrained, AWQ-INT4 offers 75% savings at ~98.5% quality. On DGX Spark, NVFP4 is optimal.

How many concurrent users can 8x H200 support?

For Llama 4 Scout (FP8): 80-120 concurrent chat users at ~12,432 tok/s. For DeepSeek V3 (FP8): 45-55 chat users at ~2,864 tok/s. "Concurrent" means actively waiting -- with 1:10-1:20 active-to-total ratios, 50 concurrent serves 500-1,000 total users.

15. Sources & References

Hardware Specs & Reviews

Benchmarks

Models

Sizing & Infrastructure

This guide provides estimated performance based on publicly available benchmarks and vendor specifications as of March 2026. Where exact benchmarks were unavailable, values are marked "(est.)" and derived from parameter count, architecture similarity, and known scaling relationships. Always conduct proof-of-concept benchmarking before finalizing hardware procurement.

On-Premises Hardware Sizing Guide for LLM Inference

1. Executive Summary

Key Takeaways

2. Hardware Platform Specifications

2.1 Comparison Table

2.2 NVIDIA DGX Spark

2.3-2.4 NVIDIA H100 & H200

3. Performance Benchmarks by Model Size

3.0 Current Model Landscape (March 2026)

3.1 DGX Spark Benchmarks

3.2-3.3 H100 & H200 Benchmarks

3.4 Intel Gaudi 3 Benchmarks

4. Memory Requirements & Quantization Impact

4.1 Model Weight Memory

4.2 Total VRAM Requirements

4.3 Quantization Impact

The AI Strategy Blueprint

5. Concurrent User Sizing Methodology

5.2 Workload Profiles

5.3 User Capacity by Hardware

6. KV Cache Memory Calculations

6.3 KV Cache by Context & Concurrency (Llama 4 Scout, FP16)

6.4 KV Cache Optimization

7. Latency Requirements & SLOs

8. Model-to-Hardware Mapping

8.3 DGX Spark Use Cases

Get Chapter 1 Free + AI Academy Access

9. Multi-GPU Scaling Configurations

9.2 Performance Scaling

10. Power, Cooling & Data Center Requirements

11. Total Cost of Ownership (TCO) Analysis

11.3 Three-Year TCO Comparison

11.4 Self-Hosting Break-Even

12. Sizing Calculator & Formulas

12.3-12.4 Quick Sizing Tables

13. Workload-Specific Recommendations

14. Decision Framework

14.1 Budget-Based Selection

14.2 Platform Scorecard

14.3 When to Choose Each Platform

Need Expert Help Sizing Your AI Infrastructure?

Frequently Asked Questions

15. Sources & References

Hardware Specs & Reviews

Benchmarks

Models

Sizing & Infrastructure

Related Resources