Research Report — v2.0 — March 2026

On-Premises Hardware Sizing Guide for LLM Inference

A comprehensive, actionable framework for sizing on-premises hardware for Large Language Model inference. Covers NVIDIA DGX Spark, H100, H200, and Intel Gaudi 3 with formulas, benchmarks, and decision matrices for 11 current-generation models.

45 min read
4 Hardware Platforms
11 LLM Models Benchmarked
30+ Comparison Tables
9 Sizing Formulas
141 GBH200 HBM3e per GPU
12,432tok/s (8xH200 + Scout)
$4,699DGX Spark Entry Point
1T+Max Model Params (MoE)

1. Executive Summary

This guide provides a comprehensive, actionable framework for sizing on-premises hardware for Large Language Model (LLM) inference. It covers four major hardware platforms -- NVIDIA DGX Spark, NVIDIA H100, NVIDIA H200, and Intel Gaudi 3 -- and provides the formulas, benchmarks, and decision matrices needed to select the right hardware for your deployment.

All benchmark data has been updated to reflect the current generation of open-weight models as of March 2026, including Llama 4 (Scout/Maverick), Qwen 3.5, DeepSeek V3/R1, Kimi K2.5, GLM-5, Mistral Large 3, Mistral Small 4, and Phi-4. Most of these models use Mixture-of-Experts (MoE) architectures, which fundamentally changes sizing: total parameter counts are large (100B-1T+), but active parameters per token are much smaller (6B-40B), making them far more deployable than their headline sizes suggest.

Key Takeaways

Decision FactorRecommendation
Budget-constrained entry pointNVIDIA DGX Spark ($4,699) for models up to 200B total params (MoE) or ~34B dense
Best price-performance for inferenceIntel Gaudi 3 (~$15,625/accelerator) at ~50% cost of H100
Maximum single-GPU model capacityNVIDIA H200 (141 GB HBM3e) -- fits Llama 4 Scout (109B MoE) on one GPU (FP8)
Highest throughput at scaleNVIDIA H200 8-GPU (~12,400 tok/s on Llama 4 Scout, ~2,864 tok/s on DeepSeek V3 FP8)
Large MoE deployment (670B-1T+)H200 8-GPU (single node, FP8) for DeepSeek V3/Mistral Large 3/GLM-5/Kimi K2.5

2. Hardware Platform Specifications

2.1 Comparison Table

SpecificationDGX SparkH100 SXMH100 PCIeH200 SXMH200 NVLGaudi 3 OAMGaudi 3 PCIe
ArchitectureGB10 Grace BlackwellHopperHopperHopper+Hopper+Gaudi 3Gaudi 3
Process Node5nm / 4nm4nm4nm4nm4nm5nm5nm
Memory128 GB unified (LPDDR5x)80 GB HBM380 GB HBM2e141 GB HBM3e141 GB HBM3e128 GB HBM2e128 GB HBM2e
Memory Bandwidth273 GB/s3,350 GB/s2,000 GB/s4,800 GB/s4,800 GB/s3,700 GB/s3,700 GB/s
FP8 Compute1 PFLOP (FP4 w/ sparsity)3,958 TFLOPS2,000 TFLOPS3,958 TFLOPS3,958 TFLOPS1,835 TFLOPS1,835 TFLOPS
BF16 Compute~500 TFLOPS1,979 TFLOPS1,000 TFLOPS1,979 TFLOPS1,979 TFLOPS1,835 TFLOPS1,835 TFLOPS
InterconnectConnectX-7NVLink 4.0 (900 GB/s)PCIe Gen5NVLink 4.0 (900 GB/s)NVLink Bridge24x 200Gb RoCE24x 200Gb RoCE
TDP240W-500W+700W350W700W600W900W600W
Form FactorDesktopSXM modulePCIe cardSXM modulePCIe cardOAM modulePCIe card
Price (per unit)$4,699$35K-$40K$25K-$30K$30K-$40K$30K-$35K~$15,625~$15,625
8-GPU SystemN/A (max 2)~$300K~$220K~$315K+~$280K~$158K~$158K

2.2 NVIDIA DGX Spark

Core Hardware
  • ChipGB10 Grace Blackwell Superchip
  • CPU20 cores (10x X925 + 10x A725)
  • Memory128 GB unified LPDDR5x
  • AI PerfUp to 1 PFLOP (FP4), ~1,000 TOPS
  • StorageUp to 4 TB NVMe SSD
  • Price$4,699
Model Capacity
  • Max (single)~200B params (quantized MoE)
  • Max (2 units)~400B+ params (FP4 MoE)
  • CES 2026 UpdateUp to 2.5x perf improvement

2.3-2.4 NVIDIA H100 & H200

SXM vs PCIe
SXM: Large MoE models requiring multi-GPU tensor parallelism, maximum throughput.
PCIe: Single-GPU inference, cost-sensitive deployments, existing PCIe infrastructure.
H200 Key Advantages
  • Memory141 GB HBM3e (76% > H100)
  • Bandwidth4,800 GB/s (43% > H100)
  • Inference Gain37-45% higher throughput vs H100
  • Long-ContextUp to 1.83-2.14x on long-context
  • EnergySame 700W, ~50% better efficiency
Intel Gaudi 3
  • Architecture64 TPCs + GEMM Engines
  • Memory128 GB HBM2e, 3,700 GB/s
  • Compute1,835 TFLOPS FP8/BF16
  • Networking24x 200Gb RoCE (saves ~$50K/node)
  • Advantage50% lower cost than H100

3. Performance Benchmarks by Model Size

3.0 Current Model Landscape (March 2026)

The open-weight model landscape has shifted heavily toward Mixture-of-Experts (MoE) architectures.

ModelTotal ParamsActive ParamsArchitectureContextUse Cases
Phi-414B14B (dense)Dense Transformer16KCode, reasoning, edge
Qwen 3.5-27B27B27B (dense)Dense Transformer262KGeneral purpose, long-context
Qwen 3.5-397B397B17BMoE (512 experts)262K-1MFlagship, multimodal
Llama 4 Scout109B17BMoE (16 experts)10MLong-context, multimodal
Llama 4 Maverick400B17BMoE (128 experts)1MReasoning, code, agentic
Mistral Small 4119B6BMoE (128 experts)128KEfficient inference, edge
Mistral Large 3675B41BMoE256KFrontier, agentic
DeepSeek V3671B37BMoE (MLA)128KGeneral purpose, reasoning
DeepSeek R1671B37BMoE (reasoning)128KDeep reasoning, STEM
Kimi K2.51,040B32BMoE (384 experts, MLA)256KAgentic, visual intelligence
GLM-5744B40BMoE (Sparse Attn)128K+Agentic coding, reasoning

3.1 DGX Spark Benchmarks

ModelPrecisionBatchPrefill (tok/s)Decode (tok/s)Framework
Phi-4 14BFP81~3,000 (est.)~40 (est.)SGLang
Qwen 3.5-27BFP81~2,500 (est.)~25 (est.)vLLM
Llama 4 Scout 109BFP41~6,000 (est.)~35 (est.)TensorRT-LLM
Mistral Small 4 119BFP41~7,000 (est.)~45 (est.)TensorRT-LLM
Qwen 3 14BNVFP4--5,929--TensorRT-LLM
DeepSeek-R1 14B (distilled)FP882,07483.5SGLang
Qwen 3 235B-A22B (2x Spark)FP4--23,477--TensorRT-LLM
Key Insight
DGX Spark excels at prefill but is limited on decode (273 GB/s bandwidth). Expect 2-50 tok/s decode. MoE models with low active params (6B-17B) run efficiently. CES 2026 updates delivered up to 2.5x improvements.

3.2-3.3 H100 & H200 Benchmarks

ModelGPUsPrecisionThroughput (tok/s)Notes
Llama 4 Scout (17B active)1x H100INT4120-150Single-GPU inference
Qwen 3.5-397B (17B active)4x H100FP8~1,400 aggregateGPUStack benchmark
DeepSeek V3 (37B active)8x H100AWQ INT4~3,000 totalGitHub benchmarks
Llama 4 Scout (17B active)8x H200FP812,432~1.5x vs H100
Qwen 3.5-397B (17B active)4x H200FP8~4,600~3.3x vs 4xH100
DeepSeek V3 (37B active)8x H200FP82,864Single node FP8
Kimi K2.5 1T (32B active)8x H200INT4~2,000-3,000 (est.)Fits single node
GLM-5 744B (40B active)8x H200FP8~1,215 outputFits single node
Mistral Large 3 (41B active)8x H200FP8~2,500-3,500 (est.)Fits single node
Key Insight
8xH200 (1,128 GB) fits all current MoE models in a single node. This is the primary advantage over H100, where 8xH100 (640 GB) cannot fit 670B+ models without aggressive quantization.

3.4 Intel Gaudi 3 Benchmarks

ModelHPUsPrecisionThroughput (tok/s)
Llama 3.1 8B1FP820,705-24,535
Llama 3.1 70B8FP818,428-21,448
Llama 3.3 70B8FP818,714-21,473
Llama 4 Scout 109B8FP8~10,000-14,000 (est.)
Key Insight
Gaudi 3 achieves 95-170% of H100 performance at ~50% hardware cost. Software ecosystem is expanding but less mature than NVIDIA.

4. Memory Requirements & Quantization Impact

4.1 Model Weight Memory

ModelTotal ParamsActiveFP16FP8INT4
Phi-414B14B28 GB14 GB7 GB
Qwen 3.5-27B27B27B54 GB27 GB13.5 GB
Llama 4 Scout109B17B218 GB109 GB~55 GB
Qwen 3.5-397B397B17B794 GB397 GB~199 GB
DeepSeek V3671B37B1,342 GB671 GB~336 GB
Mistral Large 3675B41B1,350 GB675 GB~338 GB
GLM-5744B40B1,488 GB744 GB~372 GB
Kimi K2.51,040B32B2,080 GB1,040 GB~595 GB

4.2 Total VRAM Requirements

FormulaTotal VRAM = Model Weights + KV Cache + Activations + Framework Overhead
Practical Rule
Add 30-50% to model weight size for KV cache, activations, and framework overhead.

4.3 Quantization Impact

MethodBitsMemory SavingsThroughput GainQualityBest For
FP16/BF1616BaselineBaseline100%Maximum quality
FP8850%~1.5-2.2x~99.9%H100/H200 production
INT8 (W8A8)850%~1.5-2x~99.96%General production
GPTQ-INT4475%~2.7x~98.1%Memory-constrained
AWQ-INT4475%~2.7x~98.5%Best INT4 quality
FP4/NVFP4475%~3x~97%DGX Spark / Blackwell
The AI Strategy Blueprint Book Cover
Recommended Reading

The AI Strategy Blueprint

The comprehensive guide to enterprise AI infrastructure, deployment strategy, and organizational transformation. Covers hardware selection, model deployment, security architecture, and decision frameworks.

5.0 on Amazon
$24.95
Get it on Amazon
Infrastructure Chapters
Deployment Playbooks
Security Architecture
ROI Frameworks

5. Concurrent User Sizing Methodology

Core FormulaRequired Throughput (tok/s) = Concurrent Users x Avg Output Tokens / Target Response Time (s)

5.2 Workload Profiles

WorkloadAvg InputAvg OutputLatencyTokens/Req
Chat500-2K200-5005-15s~500
Code completion200-1K50-2001-3s~150
Summarization2K-8K200-1K10-30s~1,000
RAG1K-4K200-8005-15s~800
Agentic500-2K500-2K15-60s~2,000
Batch1K-32K500-4KMinutes~4,000

5.3 User Capacity by Hardware

HardwareModeltok/sChat UsersCode Users
1x DGX SparkPhi-4 14B~400-11-2
1x H100 SXMLlama 4 Scout (INT4)~120-1502-35-8
8x H100 SXMDeepSeek V3 (AWQ)~3,00040-60100-150
8x H200 SXMLlama 4 Scout (FP8)~12,43280-120200-300
8x H200 SXMDeepSeek V3 (FP8)~2,86445-55100-130
8x Gaudi 3Llama 3.3 70B~18K-21K35-5080-120
Important
"Concurrent users" means actively waiting for a response. Typical active-to-total ratio is 1:10 to 1:20.

6. KV Cache Memory Calculations

KV Cache per TokenKV Cache (bytes) = 2 x num_layers x num_kv_heads x head_dim x bytes_per_element
MLA Models
DeepSeek V3 and Kimi K2.5 use Multi-head Latent Attention, compressing KV cache by 70-90%.

6.3 KV Cache by Context & Concurrency (Llama 4 Scout, FP16)

Context1 User8 Users32 Users64 Users128 Users
2K0.4 GB3 GB12 GB24 GB48 GB
8K1.5 GB12 GB48 GB96 GB192 GB
32K6 GB48 GB192 GB384 GB768 GB
128K24 GB192 GB768 GB1,536 GB3,072 GB

6.4 KV Cache Optimization

TechniqueSavingsQuality ImpactRecommendation
FP8 KV Cache50%NegligibleStrongly recommended on H100/H200
PagedAttention (vLLM)20-40%NoneAlways use
MLA (DeepSeek/Kimi)70-90%None (architectural)Native to model
Sparse Attention (GLM-5)~6xMinimalNative to model

7. Latency Requirements & SLOs

MetricDefinitionChat TargetCode Target
TTFTTime to First Token< 500ms< 100ms
ITLInter-Token Latency< 50ms (20+ tok/s)< 30ms (33+ tok/s)
TPOTTime Per Output Token< 33ms (30+ tok/s)< 20ms (50+ tok/s)
E2EEnd-to-End Latency< 10-15s< 3s
tok/sUser ExperienceSuitability
< 5Noticeably slow, frustratingBatch only
5-10Readable but sluggishLong-form
10-20Good streamingChat, RAG
20-40Excellent, responsiveCode, chat
40+Near-instantaneousReal-time
Human Reading Speed
~250 words/min = ~6 tokens/second. Model should generate at least 6 tok/s for streaming chat.
Prefill (compute-bound)TTFT = (Input Tokens x Active Parameters x 2 FLOP) / GPU Compute (FLOPS)
Decode (bandwidth-bound)TPOT = (Total Model Weights in bytes x 2) / Memory Bandwidth (GB/s)

8. Model-to-Hardware Mapping

ModelMin Hardware (FP16)Recommended (FP8)Budget (INT4)
Phi-4 14B1x H100 PCIe1x H100 / 1x Gaudi 3DGX Spark
Llama 4 Scout 109B4x H100 SXM2x H100 / 1x H2001x H100 (INT4)
Qwen 3.5-397B16x H100 (2 nodes)8x H100 / 4x H2004x H100 (INT4)
DeepSeek V3 671BMulti-node H1008x H200 (single node)8x H100 (AWQ INT4)
GLM-5 744BMulti-node H1008x H200 (FP8)Not practical on H100
Kimi K2.5 1TMulti-node8x H200 (INT4)8x H200 (INT4, tight)

8.3 DGX Spark Use Cases

Use CaseModelsPerformance
Dev & prototypingLlama 4 Scout, Qwen 3.5-27B, Phi-425-150 tok/s decode
Fine-tuning (LoRA)Up to Qwen 3.5-27B, Phi-4760-7,000 tok/s training
Local inference (1 user)Phi-4, Mistral Small 4 (FP4)25-80 tok/s decode
Air-gapped environmentsAny MoE up to ~200B (Q4)Slow but functional
Free Download

Get Chapter 1 Free + AI Academy Access

Download the first chapter of The AI Strategy Blueprint and get instant access to our AI Academy -- covering infrastructure planning, model selection, and deployment frameworks.

AI Strategy Blueprint Preview

9. Multi-GPU Scaling Configurations

StrategyWhen to UseCommunicationOverhead
Tensor Parallelism (TP)Within a node (NVLink)900 GB/sLow (5-15%)
Pipeline Parallelism (PP)Across nodesInfiniBand/RoCEMedium (10-30%)
Data ParallelismIndependent requestsMinimalNone per-request
Expert Parallelism (EP)MoE modelsNVLink/InfiniBandModel-dependent

9.2 Performance Scaling

ConfigModel (FP8)ThroughputMemoryInvestment
1x H100Llama 4 Scout (INT4)~120-150 tok/s80 GB$35-40K
4x H100Qwen 3 235B (FP8)~1,400 tok/s320 GB$140-160K
8x H100DeepSeek V3 (AWQ)~3,000 tok/s640 GB$300K
4x H200Qwen 3.5-397B (FP8)~4,600 tok/s564 GB$140-175K
8x H200DeepSeek V3 (FP8)~2,864 tok/s1,128 GB$315K
8x H200Llama 4 Scout (FP8)~12,432 tok/s1,128 GB$315K
8x Gaudi 3Llama 3.3 70B (FP8)~18K-21K tok/s1,024 GB~$158K
Recommendation
Always use NVLink (SXM) for tensor parallelism. PCIe is acceptable only for single-GPU deployments.

10. Power, Cooling & Data Center Requirements

ConfigurationPer-GPU TDPSystem TotalAnnual Cost (@$0.10/kWh)
1x DGX Spark~500W~500W~$440
8x H100 SXM (DGX H100)5,600W GPU~10,200W~$8,935
8x H200 SXM (HGX H200)5,600W GPU~10,200W~$8,935
8x Gaudi 3 OAM7,200W GPU~10,500W~$9,198
Power RangeCooling MethodNotes
< 1 kWStandard office HVACDesktop, no special cooling
1-5 kWStandard rack air cooling42U rack, adequate airflow
5-10 kWEnhanced air / rear-door HXHot/cold aisle recommended
10-20 kWDirect liquid cooling recommended70-75% heat via liquid
20+ kWDirect liquid cooling mandatorySupply 40C / return 50C

11. Total Cost of Ownership (TCO) Analysis

11.3 Three-Year TCO Comparison

ConfigModelHardware3-Year OpEx3-Year TCOCost per tok/s
1x DGX SparkPhi-4 14B$4,699$63K$67.7K$1,693 (40 tok/s)
8x H100 SXMDeepSeek V3 (AWQ)$300K$420K$720K$240 (3,000 tok/s)
8x H200 SXMLlama 4 Scout (FP8)$350K$420K$770K$62 (12,432 tok/s)
8x Gaudi 3Llama 3.3 70B (FP8)$158K$370K$528K$25-29 (18K-21K tok/s)

11.4 Self-Hosting Break-Even

Break-even Rule
Self-hosting becomes cost-effective when monthly API spend exceeds $12,000-$19,000.

12. Sizing Calculator & Formulas

F1: Model Weight MemoryVRAM_weights (GB) = Total Parameters (B) x Bytes_per_Parameter
F2: KV Cache MemoryKV_cache (GB) = 2 x layers x kv_heads x head_dim x seq_len x batch x bytes / (1024^3)
F3: Total GPU MemoryTotal = Weights + KV_cache + (0.1 x Weights) + (0.05 x Weights)
F4: Max Throughput (decode)Max_tok/s = Memory_Bandwidth / Model_Size_in_Memory
F5: Required ThroughputRequired_tok/s = Users x Avg_Output / Target_Latency x 1.3
F6: GPUs NeededGPUs_needed = ceil(Total_VRAM / GPU_Memory)
F7: Max Concurrent UsersMax_Users = (Total_GPU_Mem - Model_Weights) / KV_cache_per_user

12.3-12.4 Quick Sizing Tables

UsersLlama 4 Scout MinScout RecommendedDeepSeek V3 MinV3 Recommended
1-51x H2002x H1008x H2008x H200
15-504x H1008x H2008x H2002x 8-GPU H200
50-1008x H2008x H2002x 8-GPU H2003x 8-GPU H200
200-5002x 8-GPU H2004x 8-GPU H2004x 8-GPU H2008x 8-GPU H200

13. Workload-Specific Recommendations

Chat / Conversational AI
LatencyTTFT < 500ms, ITL < 50ms
ModelLlama 4 Scout, Qwen 3 235B, DeepSeek V3
Target20-40 tok/s per user
Best HWH200 SXM
Code Generation
LatencyTTFT < 100ms, ITL < 30ms
ModelPhi-4 14B, Qwen 3.5-27B
KeyLatency-sensitive, high concurrency
Best HWH100 SXM
RAG
LatencyTTFT < 1s, ITL < 50ms
ModelQwen 3.5-27B, Llama 4 Scout
KeyLong input handling (4K-32K)
Best HWH200 SXM (141 GB for KV cache)
Agentic / Tool-Use
LatencyE2E < 60s per step
ModelDeepSeek V3, Kimi K2.5, GLM-5
KeyQuality > speed
Best HW8x H200 SXM
Batch Processing
PriorityMinimize total processing time
ModelAny (Phi-4 to Kimi K2.5)
OptimizationLarge batches, FP8, EAGLE
Best HW8x H200 or 8x Gaudi 3

14. Decision Framework

14.1 Budget-Based Selection

< $10KDGX Spark -- Phi-4, Qwen 3.5-27B, Mistral Small 4 (dev/prototype)
$10K-$100K1-2x H100 PCIe or Gaudi 3 -- <10 users
$100K-$300K4-8x H100 SXM or 8x Gaudi 3 -- 10-50 users
$300K-$500K8x H200 SXM -- DeepSeek V3, GLM-5, Kimi K2.5
$500K+Multi-node H200 or Gaudi 3 -- 100+ users

14.2 Platform Scorecard

Criteria (1-5)DGX SparkH100 SXMH200 SXMGaudi 3
Inference speed2453.5
Memory capacity3354
Price-performance2345
Software ecosystem4552.5
Ease of deployment5332
Multi-GPU scaling1553.5
Max model size3454

14.3 When to Choose Each Platform

DGX Spark
  • Budget under $10K
  • Single-developer prototyping
  • Air-gapped / edge environments
  • Fine-tuning up to 27B (QLoRA)
  • No data center required
H100 SXM
  • Broadest software ecosystem
  • Phi-4 to Llama 4 Scout production
  • Battle-tested infrastructure
  • Multi-GPU tensor parallelism
H200 SXM
  • 141 GB single-GPU capacity
  • 670B-1T+ MoE models
  • Highest inference throughput
  • Long context (128K+ tokens)
Intel Gaudi 3
  • Price-performance priority
  • Standard Llama family models
  • Integrated networking saves $50K+/node
  • Budget-constrained production

Need Expert Help Sizing Your AI Infrastructure?

Our AI Strategy Consulting team helps organizations deploy on-premises LLM infrastructure.

$566K+Bundled Tech Value
78xAccuracy Improvement
6Clients per Year
Masterclass
$2,497
AI strategy training and hardware selection guidance
AI Strategy Sprint
$50,000
6-week: infrastructure assessment, model selection, deployment roadmap
Transformation Program
$150,000
End-to-end: procurement, deployment, optimization
Founder's Circle
$750K-$1.5M
Full enterprise transformation with dedicated team
Explore AI Strategy Consulting

Frequently Asked Questions

DeepSeek V3 has 671B total parameters. In FP8, model weights require 671 GB. With overhead (30-50%), you need ~870-1,000 GB total. Recommended: 8x NVIDIA H200 SXM (1,128 GB). On H100, you need 16 GPUs across two nodes (FP8) or 8 with aggressive INT4 quantization.
DGX Spark is best for development, prototyping, and single-user inference. Its 273 GB/s bandwidth limits decode to 2-50 tok/s. For production with multiple concurrent users, you need datacenter GPUs (H100, H200, Gaudi 3) with 10-17x higher bandwidth.
Total parameters include all expert networks; active parameters are the subset used per token. Example: Llama 4 Scout has 109B total but only 17B active (1 of 16 experts per token). All params must be in VRAM (memory req), but only active params affect compute/bandwidth per token.
Self-hosting generally becomes cost-effective when monthly API spend exceeds $12,000-$19,000, accounting for hardware, power, cooling, staff, and maintenance. At 10M+ tokens/day, self-hosting is significantly cheaper. Data privacy requirements may necessitate self-hosting regardless of cost.
Gaudi 3 achieves 95-170% of H100 performance at ~50% hardware cost. For Llama 70B at 8-accelerator scale: Gaudi 3 delivers ~18K-21K tok/s vs H100's ~22K tok/s. Trade-off is software ecosystem maturity -- NVIDIA has broader model support via vLLM, SGLang, and TensorRT-LLM.
FP8 with TensorRT-LLM on H100/H200 is recommended. It provides 50% memory savings with ~99.9% quality retention. If memory-constrained, AWQ-INT4 offers 75% savings at ~98.5% quality. On DGX Spark, NVFP4 is optimal.
For Llama 4 Scout (FP8): 80-120 concurrent chat users at ~12,432 tok/s. For DeepSeek V3 (FP8): 45-55 chat users at ~2,864 tok/s. "Concurrent" means actively waiting -- with 1:10-1:20 active-to-total ratios, 50 concurrent serves 500-1,000 total users.

15. Sources & References

Hardware Specs & Reviews

Benchmarks

Models

Sizing & Infrastructure

This guide provides estimated performance based on publicly available benchmarks and vendor specifications as of March 2026. Where exact benchmarks were unavailable, values are marked "(est.)" and derived from parameter count, architecture similarity, and known scaling relationships. Always conduct proof-of-concept benchmarking before finalizing hardware procurement.