How to Deploy an LLM On-Premise in 2026
A step-by-step engineering guide to deploying a large language model on hardware you own and control: VRAM and GPU sizing math, model selection, vLLM vs NVIDIA NIM, multi-GPU scaling, quantization, air-gapped setup, CPU inference on Intel Xeon, and total cost of ownership versus the cloud.
On-Premise LLM Deployment in 2026: What This Guide Covers
Deploying a large language model on-premise means running the full inference stack — model weights, serving engine, and API — on hardware you own and control, inside your own network boundary, with no dependency on a third-party API. For enterprises in regulated industries, that control is increasingly non-negotiable: by 2025, roughly 71% of AI infrastructure ran outside the public cloud, a shift driven heavily by financial-services data-residency requirements and the arrival of enforceable AI regulation.
The good news for platform teams is that the open-weight model ecosystem has matured to the point where self-hosted models rival frontier hosted APIs on most enterprise tasks, and the serving software — vLLM, NVIDIA NIM, SGLang, TensorRT-LLM — is production-hardened. The hard part is no longer "can we run it" but "how do we size it correctly and operate it reliably." This guide walks the full decision path in order: when on-prem makes sense, how to choose a model, how to compute exact VRAM requirements, how to select GPUs, how to pick and configure a serving stack, how to scale across GPUs and nodes, how to quantize, how to plan capacity from real demand, how to deploy in an air-gapped enclave, what it actually costs versus cloud, and how to run it in production.
A note on numbers: hardware specs and formulas in this guide are stable, but model versions and software defaults drift monthly. Where we name models, we use families and tiers rather than chasing point releases; where benchmarks are version-specific, we say so. For deeper dives, see our companion Hardware Sizing Guide, LLM Selection Guide, and Best AI Tools for Air-Gapped Environments.
On-Prem vs Cloud: When to Run LLMs in Your Own Data Center
The decision between on-premise and a hosted API turns on four axes: utilization, volume, data sovereignty, and latency.
On-premise wins when inference demand is sustained, predictable, and high-volume. The economics are unforgiving of idle GPUs but generous to busy ones — against hyperscaler on-demand pricing, an owned cluster typically breaks even somewhere above roughly 50–83% sustained GPU utilization, and a fully-utilized owned cluster delivers token costs 8–18x lower than equivalent cloud over a multi-year horizon. It also wins outright when data sovereignty is a legal mandate: under GDPR Article 46, EU financial institutions cannot freely route customer data through US-hosted LLM APIs, and the EU AI Act's general-purpose-AI obligations — enforceable since August 2025 — carry fines up to €35 million or 7% of global turnover. For regulated finance, healthcare, government, and defense, the deployment location is decided before any cost spreadsheet is opened.
Cloud and hosted APIs win when demand is spiky or unpredictable, when volume is low (small and mid-size workloads below ~10M tokens/month outside small-model cases), when you need frontier closed models, or when you must scale fast without capital expenditure. Token prices on hosted APIs also fell roughly 80% from 2025 to 2026, which structurally erodes the on-prem cost advantage over time and should be modeled, not assumed away.
Step 1: Choose the Right Open-Weight Model
License first: what you can legally productize
For enterprise on-prem, the license gates everything — pick the model your legal team can clear before you benchmark quality. Three tiers matter:
Fully permissive: no monthly-active-user caps, no naming obligations, explicit patent grant (Apache). Covers gpt-oss-120b/20b, all Qwen3 models, Mistral Small 3.x and Mixtral (Apache 2.0), and DeepSeek V3/R1 plus Phi-4 (MIT). MIT is the most permissive — DeepSeek even permits downstream distillation.
The Llama 4 Community License adds a clause requiring a separate Meta license if your products exceeded 700 million monthly active users in the calendar month before the model's release date, plus "Built with Llama" attribution and a "Llama-" model-name prefix. It is not OSI-approved open source.
Google's Gemma license permits commercial use after accepting its Terms of Use and Prohibited Use Policy, but is not Apache/MIT and carries redistribution restrictions.
MoE vs dense: the distinction that decides serveability
Mixture-of-Experts (MoE) models — gpt-oss, Qwen 3.5 / 3.6, DeepSeek V3.2 / V4, Kimi K2 Thinking, GLM-5, MiniMax M2.7, Gemma 4 26B-A4B, Llama 4 — activate only a fraction of their parameters per token, which lowers per-token compute and raises throughput. But the critical sizing insight is that VRAM must hold all total parameters (every expert is resident in GPU memory), while only the active parameters drive compute. A 671B MoE with 37B active still needs roughly 700 GB at FP8, and the 2026 frontier open-weights (DeepSeek V4, Kimi K2, GLM-5) are 700B–1.6T total — they require multi-GPU and usually multi-node serving (see Step 6: Scaling). Dense models (Qwen3.6-27B, Phi-4, Gemma 3, Mistral Small) are simpler to serve, have more predictable latency, and are easier to fully fit and quantize on a single GPU — which is why they are often the better choice for constrained single-node on-prem. MoE models are also especially CPU-friendly because so few parameters activate per token — see Option B: Run Inference on Intel Xeon CPUs.
Table 1 — Open-Weight Model Landscape for Enterprise On-Prem (May 2026)
| Model | Total Params | Active Params | Arch | Context | License | On-prem note |
|---|---|---|---|---|---|---|
| Frontier open-weights — multi-GPU / multi-node | ||||||
| DeepSeek V4-Pro | ~1.6T | ~49B | MoE | 1M | MIT | Most permissive frontier; needs a multi-node cluster (8+ GPUs at FP8/INT4) |
| Kimi K2 Thinking | ~1T | ~32B | MoE (reasoning) | 256K | Modified MIT | Top agentic / coding scores (SWE-bench Pro leader); multi-node |
| GLM-5 | ~744B | ~40B | MoE | 200K | MIT | Strong permissive frontier; multi-GPU |
| DeepSeek V3.2 | 671B | ~37B | MoE (MLA, 256+1 shared) | 128K | MIT | Most permissive; distillation allowed; MLA shrinks KV cache ~28x |
| DeepSeek R1 | 671B | 37B | MoE (reasoning) | 128K | MIT | Distilled 1.5–70B variants run on a single GPU |
| MiniMax M2.7 | ~230B | ~10B | MoE | 200K+ | Modified MIT | Long-context agentic; open weights |
| Qwen3.5-397B-A17B | 397B | 17B | MoE (GDN + sparse) | 262K (→1M) | Apache 2.0 | Largest open Qwen flagship; fully permissive |
| Llama 4 Maverick | 400B | 17B | MoE (128 exp), multimodal | 1M | Llama 4 Community | 700M-MAU clause; "Built with Llama" |
| Deployable single-node (1–2x 80GB GPUs) | ||||||
| gpt-oss-120b | 116.8B | 5.1B | MoE (128 exp/4 active) | 131K | Apache 2.0 | Fully permissive; single 80GB GPU via MXFP4 |
| Llama 4 Scout | 109B | 17B | MoE (16 exp), multimodal | up to 10M | Llama 4 Community | Fits 1–2x 80GB; 700M-MAU clause |
| Qwen3-Coder-Next | 80B | 3B | MoE | 256K | Apache 2.0 | Coding / agents; very low active-param footprint |
| Devstral 2 (Mistral) | 123B | 123B | Dense | 256K | Apache 2.0 | Coding-tuned dense; predictable latency |
| Small / single-GPU / edge & CPU | ||||||
| Qwen3.6-27B (dense) | 27B | 27B | Dense | 262K | Apache 2.0 | Single-GPU general / RAG; long context |
| Gemma 3 27B | 27B | 27B | Dense, multimodal | 128K | Gemma (use-restricted) | Commercial OK after terms; not OSI |
| Mistral Small 3.x 24B | 24B | 24B | Dense, multimodal | 128K | Apache 2.0 | Strong single-GPU mid-size pick |
| gpt-oss-20b | 20.9B | 3.6B | MoE (32 exp/4 active) | 131K | Apache 2.0 | Runs in 16GB; ideal for edge / Xeon CPU (Option B) |
| Phi-4 14B | 14B | 14B | Dense | 128K | MIT | Strong math; synthetic-data trained |
Total/active parameter counts marked with "~" are approximate where providers have not published exact figures for the newest frontier releases — verify specs and license terms against the model card before sizing. Benchmark scores and rankings for these models are tracked live on our LLM Benchmark Repository.
Table 1b — Full Variant Lineups: Qwen 3.5, Qwen 3.6 & Gemma 4 (May 2026)
These three families ship a complete size ladder from sub-1B edge models to 397B-parameter MoE flagships, all under permissive licenses — making them the most common starting point for on-prem standardization. The small Gemma 4 and Qwen variants are also the best fit for Intel Xeon CPU inference (Option B); the Gemma 4 26B-A4B MoE is the exact model benchmarked there.
| Variant | Total Params | Active Params | Arch / Modality | Context | License | Best-fit deployment |
|---|---|---|---|---|---|---|
| Qwen 3.6 — open weights (Apr 2026; multimodal, hybrid-thinking) | ||||||
| Qwen3.6-35B-A3B | 35B | ~3B | MoE / text+vision+code | 262K (→1M) | Apache 2.0 | Flagship open MoE; ~21 GB at Q4, ~120 tok/s on one RTX 4090 |
| Qwen3.6-27B | 27B | 27B | Dense / text+vision+code | 262K (→1M) | Apache 2.0 | Flagship-level coding; ~16.8 GB at Q4 on a single consumer GPU |
| Qwen3-Coder-Next | 80B | 3B | MoE / code+agents | 256K | Apache 2.0 | Coding/agent specialist; very low active footprint |
| Qwen 3.5 — full family, 0.8B–397B (Feb 2026; multimodal, GDN + MoE, 262K native) | ||||||
| Qwen3.5-397B-A17B | 397B | 17B | MoE / multimodal | 262K (→1M) | Apache 2.0 | Frontier flagship; multi-node cluster |
| Qwen3.5-122B-A10B | 122B | 10B | MoE / multimodal | 262K | Apache 2.0 | High-end; 2–4x 80 GB GPUs |
| Qwen3.5-35B-A3B | 35B | 3B | MoE / multimodal | 262K | Apache 2.0 | Single-node; throughput-friendly (low active params) |
| Qwen3.5-27B | 27B | 27B | Dense / multimodal | 262K | Apache 2.0 | Single 48–80 GB GPU; predictable latency |
| Qwen3.5-9B | 9B | 9B | Dense / multimodal | 262K | Apache 2.0 | Single 24 GB GPU; punches above its size |
| Qwen3.5-4B | 4B | 4B | Dense / multimodal | 262K | Apache 2.0 | Lightweight agents; edge / Xeon CPU |
| Qwen3.5-2B | 2B | 2B | Dense / multimodal | 262K | Apache 2.0 | Phones, tablets, embedded |
| Qwen3.5-0.8B | 0.8B | 0.8B | Dense / multimodal | 262K | Apache 2.0 | <2 GB VRAM at full precision; micro-edge |
| Gemma 4 — four variants (Apr 2026; multimodal text+image, audio on small) | ||||||
| Gemma 4 31B | 30.7B | 30.7B | Dense / multimodal | 256K | Apache 2.0 | Flagship dense; reportedly rivals far larger models |
| Gemma 4 26B-A4B | 26B | 3.8B | MoE (8 of 128 exp) / multimodal | 256K | Apache 2.0 | MoE; 3.69x faster than 31B dense on Intel Xeon CPU (Option B) |
| Gemma 4 E4B | ~4.5B eff. | ~4.5B eff. | Dense (edge) / multimodal | 128K | Apache 2.0 | Edge-optimized; laptops, workstations, Xeon CPU |
| Gemma 4 E2B | ~2.3B eff. (~5.1B w/ PLE) | ~2.3B eff. | Dense (edge) / multimodal | 128K | Apache 2.0 | Fits ~2 GB at Q4; runs on a Raspberry Pi |
Qwen "Plus" / "Max" tiers (e.g. Qwen3.5-Plus, Qwen 3.7 Max) are hosted, closed-weight Alibaba Cloud endpoints and are not deployable on-prem — only the numbered open-weight variants above ship downloadable weights. Gemma 4 ships under the permissive Apache 2.0 license — a notable change from the use-restricted custom Gemma license used through Gemma 3. Gemma 4 "E" sizes (E2B / E4B) use effective-parameter counts (per-layer embeddings / MatFormer), so on-disk size differs from the effective figure.
Model selection by use case
Table 2 — Use-Case Model Selection (on-prem)
| Use case | Recommended models | Why |
|---|---|---|
| General chat / assistant | Qwen3.6-27B, Gemma 4 31B, Mistral Small 3.x 24B, Llama 4 Scout (if MAU < 700M) | Strong general quality, single-node serveable, permissive (except Llama) |
| RAG / grounded enterprise | Qwen3.6-27B, Gemma 4 31B / 26B-A4B, Phi-4 14B, DeepSeek V3.2 (if cluster available) | Dense, predictable latency, long context, easy to fully fit/quantize |
| Coding | Kimi K2 Thinking, Qwen3-Coder-Next, gpt-oss-120b, DeepSeek V3.2, Devstral 2 | Leading SWE-bench Pro / agentic-coding scores, strong tool use |
| Reasoning / agentic | DeepSeek V4 / R1, Kimi K2 Thinking, GLM-5, Qwen 3.6 (thinking mode), gpt-oss-120b | RL-trained chain-of-thought, configurable reasoning effort |
| Edge / CPU-constrained | Gemma 4 E2B / E4B, gpt-oss-20b (16GB), Qwen3.5-2B / 4B, Phi-4 | Small footprint, on-device / Intel Xeon CPU inference (Option B) |
For RAG specifically, model choice is only half the equation — retrieval quality dominates grounded accuracy. Pair a dense long-context model with a disciplined ingestion pipeline; see Blockify Data Ingestion for how to structure source data before it reaches the model. For a fuller decision tree across every family, see the LLM Selection Guide.
Step 2: Do the VRAM Math (Weights + KV Cache + Overhead)
GPU memory for inference splits into four buckets: model weights, KV cache, activations, and framework/CUDA overhead. Weights and KV cache dominate. Get this math right and the rest of the deployment falls into place; get it wrong and you will either over-buy hardware or hit out-of-memory failures in production.
Model weights
The weights formula is exact:
Table 3 — Bytes per parameter by precision
| Precision | Bytes/param | VRAM per 1B params (weights) | Notes |
|---|---|---|---|
| FP32 | 4 | ~4 GB | Full precision; rarely used for inference |
| FP16 / BF16 | 2 | ~2 GB | Standard inference precision |
| FP8 | 1 | ~1 GB | Native DeepSeek-V3 training/inference precision |
| INT8 | 1 | ~1 GB | 8-bit quantization |
| INT4 / 4-bit | 0.5 | ~0.5 GB | Aggressive quantization (GPTQ/AWQ/GGUF Q4) |
KV cache (the long-context tax)
During decoding the model caches the Key and Value tensors of every prior token so it does not recompute attention each step. NVIDIA's formulas are:
The leading 2 accounts for the separate Key and Value tensors, and hidden_size = num_heads × head_dim. KV cache scales linearly with both context length and batch size while weights stay fixed — so at long context or high concurrency the KV cache can rival or exceed weight memory and becomes the binding constraint. Two corrections keep modern models from matching the naive formula's worst case:
- GQA (Grouped-Query Attention): Replace num_heads with the smaller num_kv_heads. Llama 3 70B has 64 query heads but only 8 KV heads — an 8x KV-cache reduction versus full multi-head attention.
- MLA (Multi-head Latent Attention), DeepSeek-V3: Stores a 512-dim latent per token instead of the full KV, roughly 28x smaller, cutting a ~213.5 GB max cache down to ~7.6 GB.
Activations and framework overhead
Add a runtime multiplier on top of weights. A practical rule of thumb: total VRAM ≈ weights × 1.3–1.5 for moderate concurrency and context, rising to × 1.5–2.0 for long context or high concurrency. Modal's compact sizing formula folds this in:
Worked per-model VRAM tables
Table 4 — Worked VRAM examples (weights + ~15–20% overhead unless noted)
| Model | Params (total / active) | Config (layers / hidden / KV heads) | FP16 total | INT8 | INT4 / 4-bit |
|---|---|---|---|---|---|
| Mistral 7B | 7B / 7B | 32 / 4096 / 8 (GQA) | ~18 GB | ~9 GB | ~5 GB |
| Llama 3.1 8B | 8B / 8B | 32 / 4096 / 8 (GQA) | ~20 GB | ~10 GB | ~6 GB |
| Llama 2 13B | 13B / 13B | 40 / 5120 / 40 (MHA) | ~26 GB | ~14 GB | ~8 GB |
| Llama 3.3 70B | 70B / 70B | 80 / 8192 / 8 (GQA) | ~168 GB | ~84 GB | ~46 GB |
| DeepSeek V3.2 (MoE) | 671B / 37B | 61 / 7168 / MLA (d_c=512) | ~1,543 GB | ~671 GB (FP8) | ~386 GB |
Step 3: Select Your GPUs (H100 / H200 / A100 / L40S / Blackwell / RTX)
The two axes that decide inference: capacity and bandwidth
Two GPU properties govern LLM serving. VRAM capacity gates which model and context length fit at all. Memory bandwidth governs decode/token-generation latency, because decode is memory-bandwidth-bound: every new token streams all model weights from HBM once per forward pass. This is why the H200 — which has compute identical to the H100 but 43% more bandwidth (4.8 TB/s vs 3.35 TB/s) — generates tokens roughly 43% faster in the small-batch (memory-bound) regime, despite no compute uplift.
Data-center and workstation GPU comparison
Table 5 — NVIDIA Data-Center / Pro GPU Specs for LLM Inference (2025–2026)
| GPU (variant) | Arch / Tensor Gen | VRAM | Mem Bandwidth | FP8 (dense / sparse) TFLOPS | FP16/BF16 (dense / sparse) | FP4 (dense / sparse) | NVLink/GPU | TDP |
|---|---|---|---|---|---|---|---|---|
| A100 SXM (40GB) | Ampere / 3rd | 40GB HBM2e | 1,555 GB/s | N/A (no FP8) | 312 / 624 | N/A | NVLink3 600 GB/s | 400W |
| A100 SXM (80GB) | Ampere / 3rd | 80GB HBM2e | ~2,039 GB/s | N/A (no FP8) | 312 / 624 | N/A | NVLink3 600 GB/s | 400W |
| H100 SXM5 (80GB) | Hopper / 4th | 80GB HBM3 | 3,350 GB/s | 1,979 / 3,958 | 989 / 1,979 | N/A | NVLink4 900 GB/s | 700W |
| H100 PCIe (80GB) | Hopper / 4th | 80GB HBM2e | 2,000 GB/s | ~1,513 / ~3,026 | ~756 / ~1,513 | N/A | Bridge 600 GB/s | 350W |
| H200 SXM (141GB) | Hopper / 4th | 141GB HBM3e | 4,800 GB/s | 1,979 / 3,958 | 989 / 1,979 | N/A | NVLink4 900 GB/s | 700W |
| L4 (24GB) | Ada / 4th | 24GB GDDR6 | ~300 GB/s | ~242 / ~485 | ~121 / ~242 | N/A | None (PCIe) | 72W |
| L40S (48GB) | Ada / 4th | 48GB GDDR6 ECC | 864 GB/s | 733 / 1,466 | 366 / 733 | N/A | None (PCIe) | 300W |
| RTX 6000 Ada (48GB) | Ada / 4th | 48GB GDDR6 ECC | 960 GB/s | ~728 / ~1,457 | ~364 / ~728 | N/A | None | 300W |
| B200 SXM (192GB) | Blackwell / 5th | 192GB HBM3e | 8,000 GB/s | 4,500 / 9,000 | 2,250 / 4,500 | 9,000 / 18,000 | NVLink5 1,800 GB/s | 1,000W |
| GB200 (= 2x B200 + Grace) | Blackwell / 5th | 2x192GB HBM3e | 2x 8,000 GB/s | 2x 4,500 dense | 2x 2,250 dense | 2x 9,000 dense | NVLink5 1,800 GB/s | ~2,700W |
| RTX PRO 6000 Blackwell (96GB) | Blackwell / 5th | 96GB GDDR7 ECC | 1,800 GB/s | ~2,000 (AI TOPS class) | — | ~4,000 AI TOPS | None | 600W (WS) / 300W (Server) |
Organize procurement by generation, because precision support tracks it: Ampere (A100) tops out at INT8/FP16 with no FP8; Hopper (H100/H200) adds FP8; Blackwell (B200/GB200, RTX PRO 6000) adds native FP4/NVFP4, which roughly doubles throughput and halves memory versus FP8 and is the 2025–2026 cost-per-token frontier. Note that L40S, L4, and the RTX cards have no NVLink — they scale only over PCIe, which makes them better suited to pipeline parallelism than tensor parallelism (see Step 6).
Table 6 — Approximate Model-Size Fit by VRAM (weights-only, +20–40% for KV/runtime)
| Model size | FP16 weights (~2GB/1B) | INT4 (~0.5GB/1B) | Single-GPU fit (FP16) | Single-GPU fit (INT4) |
|---|---|---|---|---|
| 7B | ~14 GB | ~4 GB | Any 24GB+ (L40S/A100/H100 easily) | Any 8GB+ |
| 13B | ~26 GB | ~7 GB | 48GB+ (L40S / RTX 6000 Ada / A100-80 / H100) | 24GB (L4 tight) |
| 34B | ~68 GB | ~17 GB | 80GB+ (A100-80 / H100); tight | 48GB (L40S / RTX 6000 Ada) |
| 70B | ~140 GB | ~35–40 GB | 141GB H200 single GPU; else 2x H100 (TP) | 48GB tight / 80GB comfortably |
| 180B (Falcon-class) | ~360 GB | ~90 GB | Multi-GPU only | 96GB RTX PRO 6000 / B200 192GB |
| Trillion-param (MoE) | Rack-scale | Rack-scale | GB200 NVL72 (72-GPU NVLink domain) | GB200 NVL72 |
Consumer GPUs (RTX 4090 / 5090): where they fit and where they stop
For 7B–13B single-GPU inference, consumer cards are genuinely competitive and cost a fraction of data-center GPUs — an RTX 4090 can match or beat an A100 on small models. The ceilings are firm: a single 24GB 4090 tops out around 32B at Q4; a 32GB 5090 fits 32B at Q8 and 70B only at aggressive Q2/Q3 with tiny context; comfortable 70B-Q4 needs dual GPUs (48GB combined). Neither consumer card has NVLink, so multi-GPU communication runs over PCIe, achieving roughly 85–90% of NVLink-linked throughput with about a 30% loss versus a monolithic 80GB card. For any model needing 40–80GB+ of VRAM there is no consumer alternative — data-center cards are required.
Table 7 — Consumer vs Data-Center Spec Comparison
| GPU | VRAM | Bandwidth | NVLink | Native FP4 | TDP | MSRP |
|---|---|---|---|---|---|---|
| RTX 4090 | 24GB GDDR6X | ~1.0 TB/s | No | No | 450W | $1,599 |
| RTX 5090 | 32GB GDDR7 | 1.79 TB/s | No | Yes (MXFP4) | 575W | $1,999 |
| A100 | 40/80GB HBM2e | ~2.0 TB/s | Yes | No | 400W | data-center |
| H100 SXM | 80GB HBM3 | ~3.35 TB/s | Yes (NVLink/NVSwitch) | No (FP8) | 700W | data-center |
| H200 | 141GB HBM3e | ~4.8 TB/s | Yes | No (FP8) | 700W | data-center |
Option B: Run Inference on Intel Xeon CPUs with AirgapAI Edge (No GPU)
GPUs are the default path, but they are not the only one. AirgapAI Edge runs LLM inference entirely on Intel Xeon CPUs — no GPU required — using Intel AMX (Advanced Matrix Extensions) acceleration with the OpenVINO Model Server (and llama.cpp built with AMX kernels). For teams with no GPUs, constrained power and cooling, or an existing Xeon fleet to reuse, this turns on-prem LLM serving into a software problem rather than a hardware procurement project. Learn more on the AirgapAI product page, and see the crossover math in Edge AI vs Cloud Economics.
Why MoE makes CPU inference viable
On the same box, the 26B-A4B MoE model ran 3.69x faster than a 31B dense model (8.77 tok/s at INT4) — because the MoE activates only ~4B parameters per token, dramatically easing the CPU memory-bandwidth bottleneck that throttles dense models. CPU inference is fundamentally memory-bandwidth bound: streaming fewer active weights per token is exactly what a CPU needs to stay interactive.
Table B1 — AirgapAI Edge on a Single Half-Socket Intel Xeon 6 (Granite Rapids, 48 cores, 768 GiB)
| Model / Precision | Active Params | Single-stream decode | Aggregate @ 16 concurrent | Cost per page (16-way) | Throughput per box |
|---|---|---|---|---|---|
| Gemma-class 26B-A4B MoE (INT8 / Q8_0) | ~4B of 26B | ~32 tok/s (~3x reading speed) | ~105 tok/s (16/16 success) | ~$0.044 per page | up to ~4,100 pages/day |
| 31B dense model (INT4) | 31B (all) | 8.77 tok/s | — | — | — (MoE 3.69x faster) |
Economics, fully on-prem / in-VPC: at 16-way concurrency on a 600-token-in to 2,000-token-out workload, AirgapAI Edge costs roughly $0.044 per page, about $181/day per box, processing up to ~4,100 pages/day per box — with no GPU, no data egress, and no per-token API fee.
Why CPU inference is fast enough in 2026
AMX-INT8 kernels deliver roughly 2x the throughput of AMX-BF16 on Granite Rapids, turning the tile-matrix unit into the inference workhorse.
INT8/INT4 weight quantization plus an 8-bit (u8) KV cache shrink the memory footprint and the bandwidth the CPU must stream per token.
A free 1.4–2x speedup on RAG and grounded tasks where output echoes input — no draft model required.
Multi-token-prediction / speculative decoding yields up to 2–3x on MoE models, compounding the AMX and quantization gains.
When to choose AirgapAI Edge over GPUs
AirgapAI Edge is fully offline / air-gapped — OpenVINO runs from a local model IR with no telemetry — and pairs with Blockify for on-prem RAG ingestion, so the entire retrieval-and-generation pipeline stays inside your boundary on CPU hardware you already own.
Step 4: Pick a Serving Stack — vLLM vs NVIDIA NIM
The serving engine is what turns model weights into a production API. The two leading choices for on-prem are vLLM (open-source, maximum flexibility) and NVIDIA NIM (enterprise-packaged, vendor-supported). They share an OpenAI-compatible API surface, so application code rarely changes when you switch.
vLLM: the open-source default
vLLM is a high-throughput, memory-efficient inference engine originally from UC Berkeley (2023), built around two innovations:
- PagedAttention applies OS-style virtual-memory paging to the KV cache. Each sequence's KV cache is addressed through a logical block table mapping to non-contiguous physical blocks (default block size 16 tokens), eliminating the contiguous-allocation fragmentation that wasted 60–80% of KV memory in naive serving — reducing waste to under 4%.
- Continuous (in-flight) batching schedules at the per-token level: when a request finishes it immediately frees its KV blocks and the next queued request is admitted on the following step, keeping the GPU near 100% utilized. vLLM cites up to ~4x more tokens/sec versus naive Hugging Face generation.
You launch it with vllm serve <model>, which listens on 0.0.0.0:8000 and exposes /v1/chat/completions, /v1/completions, /v1/embeddings, /v1/models, plus /health and a Prometheus /metrics endpoint. Connect with the standard OpenAI Python client by setting base_url='http://localhost:8000/v1'; require auth with --api-key or the VLLM_API_KEY environment variable.
# Minimal single-GPU vLLM launch
vllm serve /models/qwen3-32b --gpu-memory-utilization 0.9 --max-model-len 32768 --api-key "$VLLM_API_KEY"Table 8 — Most-used vllm serve engine arguments
| Flag | Purpose | Default / typical |
|---|---|---|
--tensor-parallel-size | Shard model across GPUs in a node | = GPUs per node |
--pipeline-parallel-size | Split layers across nodes | = number of nodes |
--gpu-memory-utilization | Fraction of VRAM for weights+activations+KV | 0.9 |
--max-model-len | Max context length | model default |
--max-num-batched-tokens | Per-step token budget (controls chunked prefill) | version/model dependent |
--max-num-seqs | Max concurrent sequences | version dependent |
--block-size | KV cache block size (tokens) | 16 |
--kv-cache-dtype | KV cache precision (e.g. fp8) | auto |
--quantization | Weight quantization method | none |
--host / --port | Bind address / port | 0.0.0.0 / 8000 |
The current V1 engine (default since vLLM v0.8.0) is a core rewrite delivering up to 1.7x higher throughput than V0, with FlashAttention 3, piecewise CUDA graphs, and near-zero-overhead prefix caching (under 1% throughput drop even at a 0% cache-hit rate, so it is on by default).
NVIDIA NIM: the enterprise-packaged option
NVIDIA NIM (NVIDIA Inference Microservices) packages a model, an optimized inference engine, and an OpenAI-compatible API server into a single prebuilt Docker container that runs on NVIDIA GPUs anywhere. It auto-selects among TensorRT-LLM, vLLM, and SGLang backends and applies performance-tuned settings; the NIM LLM 2.0 line moved to a "one container, one backend" philosophy built on vLLM for predictable behavior. The default serving port is 8000, with native OpenAI endpoints plus a /metrics observability endpoint.
Prerequisites for the latest NIM LLM (2026): NVIDIA driver 580+ with CUDA 13.0+ (older NIMs accept CUDA 12.1+), Docker ≥ 19.03, and the NVIDIA Container Toolkit; the CUDA Toolkit does not need to be on the host, only the driver. A typical single-node run:
docker run --runtime=nvidia --gpus all --shm-size=16GB -v ~/.cache/nim:/opt/nim/.cache -u $(id -u) -p 8000:8000 <nim-llm-container>Table 9 — NIM offering tiers
| Tier | Purpose | Notable attributes |
|---|---|---|
| NIM Day 0 | Rapid access to newly released models | Earliest availability, less hardening |
| NIM Turbo | Validated performance | Performance-optimized, validated profiles |
| NIM Certified | Enterprise production | CVE patching, OSRB open-source review compliance, AI Enterprise support |
Table 10 — NIM licensing / access tiers
| Tier | Cost | Limits / terms |
|---|---|---|
| Developer Program (free) | $0 | Up to 2 nodes / 16 GPUs; 1,000 inference credits at signup (up to 5,000 on request); research/dev/test only |
| AI Enterprise 90-day eval | $0 for 90 days | Free evaluation license for production validation |
| AI Enterprise (production) | ~$4,500 per GPU/year or ~$1 per GPU/hour (cloud) | Per-GPU pricing (not per-NIM); same price regardless of GPU size; includes support + Certified NIMs |
The AI Enterprise list price (~$4,500/GPU/year) is a starting figure subject to volume and term discounts — confirm with NVIDIA sales.
vLLM vs NIM vs the rest of the field
The honest tradeoff: vLLM gives maximum flexibility, zero license cost, and the fastest access to new open models, at the price of you owning integration, hardening, and support. NIM gives a turnkey container with vendor SLAs, proactive security patching, and validated performance profiles, at the price of NVIDIA AI Enterprise licensing and tighter version coupling. Raw throughput between the top GPU engines is narrow — within roughly 15% — and flips by workload.
Table 11 — On-Prem LLM Serving Engine Comparison (2026)
| Engine | Core Tech | OpenAI-Compatible | Quantization | Throughput Tier | Ease of Setup | Enterprise Support | Best-Fit Use Case |
|---|---|---|---|---|---|---|---|
| vLLM | PagedAttention + continuous batching | Yes | GPTQ, AWQ, FP8 | Highest (100+ QPS) | Moderate | Community / commercial via vendors | General-purpose production multi-user GPU serving |
| NVIDIA NIM | Prebuilt optimized containers | Yes | FP8 + TRT-LLM quant | High | Easy (turnkey) | Yes — NVIDIA AI Enterprise (SLAs, security patches) | Enterprises needing vendor support, stability, security SLAs |
| TensorRT-LLM | Compiled CUDA kernels + KV reuse | Yes (via Triton/serve) | FP8, paged+quantized KV | Highest latency-optimized (NVIDIA-only) | Hard (long compile) | Via NVIDIA AI Enterprise | Latency-sensitive, high-volume, NVIDIA-standardized fleets |
| SGLang | RadixAttention (radix-tree KV reuse) | Yes | FP8, AWQ | Very high on shared-context | Moderate | Community | Agents, RAG, structured generation, high prefix reuse |
| Hugging Face TGI v3 | Chunked prefill + prefix caching | Yes | GPTQ, AWQ, EETQ | High | Moderate | Community (upstream in maintenance mode 2026) | HF-ecosystem teams, long chat histories |
| Ollama | Wraps llama.cpp; auto model mgmt | Yes | GGUF (Q2–Q8) | Medium (10–50 QPS) | Easiest (one command) | Community | Local dev, prototyping |
| llama.cpp | C/C++ GGUF runtime | Yes (server mode) | GGUF (Q2–Q8) | Low-medium (5–30 QPS) | Easy (binary + GGUF) | Community | CPU-only servers, edge, embedded |
Table 12 — Single H100 SXM5 80GB Benchmark, Llama-3.3-70B-Instruct FP8 (~512 in / ~256 out)
| Metric | Concurrency | vLLM v0.18.0 | TensorRT-LLM v1.2.0 | SGLang v0.5.9 |
|---|---|---|---|---|
| Throughput (output tok/s) | 1 req | 120 | 130 | 125 |
| Throughput (output tok/s) | 10 req | 650 | 710 | 680 |
| Throughput (output tok/s) | 50 req | 1,850 | 2,100 | 1,920 |
| Throughput (output tok/s) | 100 req | 2,400 | 2,780 | 2,460 |
| TTFT p50 (ms) | 100 req | 740 | 680 | 710 |
| TTFT p95 (ms) | 100 req | 1,450 | 1,280 | 1,380 |
| Peak VRAM @100 req (GB) | 100 req | 78 | 79 | 78 |
| Cold start | first load | ~62 s | ~28 min (compile) | ~58 s |
The decisive operational figure is the cold start: TensorRT-LLM's ~28-minute first-time engine compile (subsequent reloads ~90s) makes it painful for rapid model iteration, whereas vLLM and SGLang start in about a minute. A common, sound pattern is to develop and prototype on Ollama or llama.cpp, then serve production on vLLM or NIM. For the broader tool landscape, see Best Local AI Tools for Enterprise.
The AI Strategy Blueprint
The executive playbook for aligning AI strategy with infrastructure decisions — covering model selection, deployment architecture, security, and the ROI frameworks behind on-premise and edge AI investments.
Step 5: Understand Throughput and Latency (Tokens/sec, TTFT, ITL)
Four metrics define serving performance:
- Throughput — total output tokens/sec across all concurrent requests.
- TTFT (Time To First Token) — latency from request to first token, dominated by prefill of the input prompt.
- ITL (Inter-Token Latency), a.k.a. TPOT — time between successive output tokens during decode. Per-request decode speed = 1000 / ITL tokens/sec.
- Goodput — throughput that meets your SLOs.
The mechanism that explains everything: prefill is compute-bound, decode is memory-bandwidth-bound. Aggregate system throughput and per-request latency move in opposite directions as concurrency rises — continuous batching keeps the GPU busy and lifts total tokens/sec, but each individual request's ITL grows because the GPU time-slices decode across more sequences.
Concrete anchors: a single H100 running Llama 3.1 8B in vLLM peaks around 12,500 tokens/sec aggregate, with sub-80ms TTFT at low concurrency and ITL of ~11–21ms. For 70B, a single H200 reaches >3,800 tok/s/GPU at FP8 (up to 6.7x faster than A100), and 8x H100 in MLPerf delivered 24,525 tok/s total (~3,066 per GPU). The H100-vs-A100 gap widens sharply with concurrency: at 16 concurrent requests the H100 produced the first token roughly 16x faster than the A100.
Table 13 — Batch size / concurrency effect on throughput (Llama, H200 FP8, TP=1)
| Model | Batch size | Input/Output tokens | Throughput (tok/s) | Takeaway |
|---|---|---|---|---|
| Llama-13B | 1024 | 128/128 | 11,819 | Large batch maximizes aggregate throughput |
| Llama-13B | 128 | 128/2048 | 4,750 | Long output lowers per-batch tok/s |
| Llama-70B | 512 | 128/128 | 3,014 | Peak 70B aggregate at large batch |
| Llama-70B | 64 | 2048/128 | 341 | Long input (prefill) crushes throughput |
| Llama-70B | 32 | 2048/128 | 303 | Smaller batch + long prompt = lowest tok/s |
Long prompts collapse throughput because prefill cost dominates — note how 70B falls from 3,014 tok/s (128 input) to ~303 tok/s (2,048 input).
Table 14 — Latency SLA targets (MLPerf Inference v5.1, Llama 3.1 8B scenarios)
| Scenario | TTFT limit | TPOT / ITL limit | Approx reading speed |
|---|---|---|---|
| Server | ≤ 2 s | ≤ 100 ms | ~480 words/min |
| Interactive | ≤ 0.5 s | ≤ 30 ms | ~1,600 words/min |
| Practical interactive bar (vLLM/H100, up to 70B) | < 200 ms | < 30 ms (8B ITL ~11–21 ms observed) | Fluid streaming |
Step 6: Scale Across GPUs — Tensor, Pipeline, and Expert Parallelism
When a model exceeds one GPU, you shard it. There are three primary strategies, and matching them to your interconnect is what separates a fast cluster from a slow one.
Tensor parallelism (TP) shards each layer's weight matrices across GPUs (the Megatron column-parallel to row-parallel pattern), producing exactly two all-reduce collectives per transformer layer in the forward pass. Llama-3-70B's 80 layers means 160 all-reduce synchronization points per forward pass — so TP is bandwidth-bound and effectively requires NVLink/NVSwitch. On 4x L40 without NVLink, communication can exceed 50% of prefill cost.
Pipeline parallelism (PP) splits the model by layers across stages and only passes activations at stage boundaries, so it tolerates slower inter-node links (InfiniBand or even Ethernet) far better than TP. Expert parallelism (EP) shards MoE experts across GPUs, using all-to-all dispatch/combine; it pairs with data-parallel attention for large MoE models like DeepSeek-V3/R1.
The vLLM decision rule is clean: TP inside a node, PP across nodes, with tensor_parallel_size = GPUs per node and pipeline_parallel_size = number of nodes. For 2 nodes x 8 GPUs: --tensor-parallel-size 8 --pipeline-parallel-size 2. The critical exception: if GPUs lack NVLink (e.g. L40S) or the GPU count does not evenly divide the model, use pipeline parallelism instead of tensor parallelism.
Table 15 — vLLM Parallelism Strategy Selection
| Scenario | Recommended config | Example flags |
|---|---|---|
| Model fits on 1 GPU | Single GPU, no distribution | (none) |
| Single node, multiple GPUs, NVLink present | Tensor parallel = GPU count | --tensor-parallel-size 4 |
| Multi-node, multiple GPUs | TP = GPUs per node, PP = number of nodes | --tensor-parallel-size 8 --pipeline-parallel-size 2 |
| Single node, no NVLink (e.g. L40S) or uneven split | TP=1, PP = GPU count | --tensor-parallel-size 1 --pipeline-parallel-size 8 |
| Large MoE (DeepSeek-V3/R1, Mixtral) | DP attention + EP experts | --tensor-parallel-size 1 --data-parallel-size 8 --enable-expert-parallel |
Table 16 — NVLink / NVSwitch bandwidth by GPU generation
| GPU / Generation | NVLink gen | Per-GPU bandwidth (bidirectional) |
|---|---|---|
| A100 (Ampere) | NVLink 3 | 600 GB/s |
| H100 (Hopper) | NVLink 4 | 900 GB/s |
| Blackwell (B200/GB200) | NVLink 5 | 1,800 GB/s |
| Rubin (announced) | NVLink 6 | 3,600 GB/s |
[send] via NET/IB/GDRDMA (good) versus [send] via NET/Socket (slow fallback). Container requirements for TP: run with --ipc=host --shm-size=16G -v /dev/shm:/dev/shm; on Kubernetes mount a /dev/shm emptyDir and grant IPC_LOCK — a missing /dev/shm is a common cause of hangs and OOMKilled pods. Single-node multi-GPU uses native multiprocessing; multi-node currently requires Ray.Step 7: Quantize for Memory and Throughput (FP8 / INT8 / INT4 / FP4)
Quantization is the highest-leverage lever for fitting a model on fewer GPUs. The precision ladder runs FP32 to FP16/BF16 to FP8 to INT8 to INT4/FP4, halving memory roughly at each step beyond FP16.
The accuracy results are encouraging. Red Hat/Neural Magic's study spanning over 500,000 evaluations on the Llama-3.1 family found FP8 (W8A8-FP) effectively lossless across all model scales, INT8 (W8A8-INT) showing a surprisingly low 1–3% degradation per task, and even INT4 weight-only (W4A16) "more competitive than expected, rivaling 8-bit." The reason FP8 is near-lossless while INT8 needs calibration: FP8's exponential value spacing handles outlier activations gracefully, whereas INT8's uniform spacing needs SmoothQuant-style calibration.
Table 17 — Quantization format comparison (vs FP16/BF16 baseline)
| Format | Bits (W/A) | Memory vs FP16 | Accuracy vs BF16 | Throughput note | Best for |
|---|---|---|---|---|---|
| FP16 / BF16 | 16 / 16 | 1x (baseline) | Baseline | Baseline | Max accuracy, fine-tune |
| FP8 W8A8 (E4M3) | 8 / 8 | ~2x smaller | Effectively lossless (all scales) | ~33% faster tok/s on H100 | High-throughput continuous batching |
| INT8 W8A8 (SmoothQuant) | 8 / 8 | ~2x smaller | 1–3% drop per task | Strong on Ampere/Turing (no FP8 HW) | High-throughput on pre-Ada GPUs |
| INT4 W4A16 (AWQ) | 4 / 16 | ~4x smaller | Competitive, rivals 8-bit | Marlin kernel ~741 tok/s (~10.9x vs no-Marlin) | Latency / low-batch sync serving |
| INT4 W4A16 (GPTQ) | 4 / 16 | ~4x smaller | Slightly below AWQ | Marlin-accelerated on Ampere+ | Latency / low-batch sync serving |
| GGUF Q4_K_M (llama.cpp) | ~4.5 / mixed | ~4x smaller | ~6.74 ppl vs 6.56 BF16 | CPU/mixed | CPU / Apple Silicon / edge |
| bitsandbytes NF4 / INT8 | 4 or 8 / 16 | ~4x / ~2x | NF4 ~6.66 ppl | On-the-fly (no prequant) | Experimentation, QLoRA |
| NVFP4 (Blackwell) | 4 / 4 | ~4x smaller | Near-FP8 with calibration | ~2x math throughput vs FP8 | Blackwell high-throughput serving |
AWQ (activation-aware) slightly edges GPTQ (Hessian-based) on perplexity, and the Marlin kernel makes both fast on Ampere+. Hardware support is the practical constraint:
Table 18 — Engine x GPU-architecture support matrix (2026, version-sensitive)
| Format | Ampere SM8.0/8.6 | Ada SM8.9 | Hopper SM9.0 | Blackwell SM100/103/120 | vLLM | TensorRT-LLM |
|---|---|---|---|---|---|---|
| FP8 W8A8 | No | Yes | Yes | Yes | Yes (Ada/Hopper+) | Yes |
| INT8 W8A8 | Yes | Yes | Yes | No (CC≥10.0 unsupported in vLLM) | Yes | Yes (SmoothQuant) |
| INT4 W4A16 AWQ | Yes | Yes | Yes | Yes | Yes (AutoAWQ + Marlin) | Yes |
| INT4 W4A16 GPTQ | Yes | Yes | Yes | Yes | Yes (GPTQModel + Marlin) | Yes |
| NVFP4 / MXFP4 | No | No | No | Yes | Yes (NVIDIA ModelOpt) | Yes (Blackwell only) |
| GGUF | Yes | Yes | Yes | Yes | Yes | No (llama.cpp) |
| FP8 KV cache | Yes | Yes | Yes | Yes | Yes | Yes |
Get Chapter 1 Free + AI Academy Access
Download the first chapter of The AI Strategy Blueprint and get instant access to our AI Academy — covering infrastructure planning, model selection, and on-premise deployment frameworks.
Step 8: Manage the KV Cache and Long Context
The KV cache caches Keys and Values from prior tokens to avoid O(n²) recomputation each decode step, and at long context it is the primary memory bottleneck, frequently exceeding weight memory. Per-token cost = 2 × num_layers × num_kv_heads × head_dim × bytes_per_element, scaled by tokens × batch.
Table 19 — KV Cache per-token cost and VRAM (Llama 3.1 70B, single sequence)
| Precision | Bytes/element | Per-token KV cost | KV cache at 32K ctx | KV cache at 128K ctx |
|---|---|---|---|---|
| BF16/FP16 | 2 bytes | ~0.31 MB (310 KB) | ~10 GB | ~42.9 GB |
| FP8 (e4m3/e5m2) | 1 byte | ~0.155 MB | ~5 GB | ~21.5 GB |
| NVFP4 / 4-bit (Blackwell) | 0.5 bytes | ~0.078 MB | ~2.7 GB | ~10.7 GB |
Every additional 1,000 tokens of context adds ~310 MB for a 70B-class model at BF16, and FP8 KV-cache quantization halves the footprint. Two techniques tame this:
- GQA/MQA shrink the cache by the ratio of query heads to KV heads. Llama 3.1 70B's 8 KV heads (versus 64 query heads) give an 8x reduction — which can mean 2 GPUs instead of 4 at 128K context.
- Automatic prefix caching (vLLM --enable-prefix-caching, on by default) hashes complete 16-token KV blocks (SHA-256) and reuses them across requests sharing a prefix — system prompts, tool definitions, few-shot examples — with LRU eviction and a cache_salt for multi-tenant isolation.
Table 20 — PagedAttention vs prior serving systems (vLLM paper)
| Metric | Prior systems | vLLM PagedAttention |
|---|---|---|
| KV-cache memory waste | 60%–80% (fragmentation + over-reservation) | under 4% (last partial block only) |
| Throughput vs HF Transformers | 1x | 14x–24x |
| Throughput vs TGI (1 completion) | 1x | 2.2x–2.5x |
Step 9: Tune Batching and Speculative Decoding
Continuous (in-flight) batching is the single biggest throughput lever: rather than padding to a fixed batch, the engine evicts finished requests and admits queued ones every step. The vLLM V1 scheduler can mix prefill and decode in the same step, prioritizing decode then filling the remaining token budget with (chunked) prefill.
Chunked prefill splits a long prompt's prefill across steps so one long request cannot stall all others — the technique introduced by Sarathi-Serve. The tuning tradeoff: a smaller max_num_batched_tokens (e.g. 2048) gives better ITL because fewer prefill tokens stall decodes; a higher value gives better TTFT and throughput.
Speculative decoding drafts k tokens cheaply, then verifies them in one target-model forward pass, accepting the longest valid prefix. vLLM supports n-gram/prompt-lookup, draft-model, EAGLE/EAGLE-3, and Medusa/MTP.
Table 21 — Speculative decoding methods in vLLM
| Method | Proposer | Key config | Notes |
|---|---|---|---|
| n-gram / prompt-lookup | Match trailing n-gram, propose following k tokens | method=ngram, num_speculative_tokens, prompt_lookup_max | Best when output echoes input (RAG, code edit) |
| Draft model | Small separate LLM | model=<draft>, num_speculative_tokens=5 | Needs a quality draft sharing the target vocab |
| EAGLE / EAGLE-3 | Lightweight MLP replacing target transformer stack | method=eagle3, draft_tensor_parallel_size=1 | Top performer; draft runs without TP even if target uses TP |
| Medusa / MTP | Auxiliary heads predict next k tokens | draft_tensor_parallel_size=1 | No separate draft model |
Capacity Planning: A Worked Sizing Example, End to End
This is where the math becomes a purchase order. The flow: define demand to compute memory to compute per-token timing to convert to GPU count to apply SLO-driven utilization ceilings and headroom.
Worked GPU-count example
Suppose peak demand is 1,000 requests/sec, average service time 40 ms, target GPU utilization 70%. Per-H100 service rate = 0.70 / 0.040 = 17.5 RPS. GPU count = ceil(1000 / 17.5) = 60 H100 instances. But the SLO sets the utilization ceiling, because P99 TTFT degrades nonlinearly with concurrency:
Table 22 — P99 TTFT degradation vs concurrency (70B FP8 on H100 SXM5, 512-token prompts)
| Concurrent Requests | P50 TTFT | P99 TTFT | P99/P50 |
|---|---|---|---|
| 8 | 45ms | 90ms | 2.0x |
| 16 | 52ms | 160ms | 3.1x |
| 32 | 68ms | 280ms | 4.1x |
| 64 | 95ms | 480ms | 5.1x |
Table 23 — SLO target to max GPU utilization ceiling, and resulting fleet (1,000 RPS, 40ms service, H100 spot)
| TTFT P99 target | Max GPU utilization | Instances (ceil) | Monthly cost (spot) |
|---|---|---|---|
| 200ms | 55% | 73 | $88,826 |
| 300ms | 63% | 64 | $77,875 |
| 400ms | 70% | 60 | $73,008 |
| 500ms | 75% | 54 | $65,707 |
Air-Gapped and Secure Deployment
For classified, defense, and the most sensitive regulated workloads, air-gapping is the deployment model — and it is an architecture, not a configuration flag. Every runtime dependency must be pre-staged inside the enclave: a signed model registry, GPU inference workers, a local vector DB with a local embedding model, a container-registry mirror, OS/language package mirrors, on-prem observability, and internal PKI. True air-gap means no NAT, no DNS to external hostnames, no public CA chain, and no route by which a packet can leave. The single most common way an "air-gapped" RAG stack secretly breaks the gap is calling a remote embedding API — the embedding model must run inside the enclave alongside the LLM. For a fuller treatment, see Best AI for Air-Gapped Environments.
The workflow is two-phase. On a connected staging host, pre-download models and containers; verify SHA-256/signatures; physically transfer across the gap; then run isolated. For NVIDIA NIM, the connected host sets NGC_API_KEY and LOCAL_NIM_CACHE, runs download-to-cache -p <profile-hash>, copies the cache to AIR_GAP_NIM_CACHE, and the disconnected host mounts it at /opt/nim/.cache and runs the container without NGC_API_KEY or HF_TOKEN — omitting the keys prevents any model-download, registry, or telemetry call. For open-source vLLM, use huggingface-cli/snapshot_download on the connected host, serve a local directory path (not a hub repo ID), and set HF_HUB_OFFLINE=1 so the tokenizer resolves locally.
Table 24 — Telemetry / phone-home kill switches by component (air-gap hardening)
| Component | Variable / mechanism | Effect |
|---|---|---|
| Hugging Face Hub | HF_HUB_OFFLINE=1 | No HTTP to the Hub; cache-only; skips cached-file version check |
| Transformers | TRANSFORMERS_OFFLINE=1 | Loads strictly from local cache |
| HF ecosystem | HF_HUB_DISABLE_TELEMETRY=1 (or DO_NOT_TRACK=1) | Disables usage telemetry across transformers/datasets/diffusers/gradio |
| HF auth | HF_HUB_DISABLE_IMPLICIT_TOKEN=1 | Stops auto-attaching token to read requests |
| vLLM | VLLM_NO_USAGE_STATS=1 / VLLM_DO_NOT_TRACK=1 / ~/.config/vllm/do_not_track | Disables default-on anonymous usage stats |
| NVIDIA NIM (air-gap run) | Omit NGC_API_KEY and HF_TOKEN | Runs from mounted cache with no registry/Hub callouts |
Mirror every container image through a frozen local registry (Harbor, or oc-mirror on OpenShift) and version-pin scanned PyPI/npm/apt snapshots. Updates arrive as signed tarballs (manifests + images + Helm charts) physically walked across the gap, integrity- and signature-verified before staging, on a slow cadence — monthly (healthcare) to quarterly (defense). Use the customer's internal PKI with mTLS between gateway and workers; there is no route to a public CA.
Table 25 — Compliance frameworks for on-prem / air-gapped LLM
| Framework | Key figure / control set | Air-gap relevance |
|---|---|---|
| FedRAMP High | 421 controls | Eliminates boundary-defense & external-monitoring control categories (no boundary) |
| DoD Impact Levels | IL4 = CUI, IL5 = CUI+mission-critical, IL6 = classified to SECRET | Air-gap required/expected at IL5–IL6 |
| CMMC 2.0 Level 2 | NIST SP 800-171 (110 controls) | Eases MP, SC, AC families; avoids 32 CFR Part 170 FedRAMP-Moderate cloud rule on-prem |
| CMMC 2.0 Level 3 | NIST 800-171 + 800-172 enhanced | Highest CUI tier; air-gap simplifies enhanced SC/AC |
| HIPAA | Not required; BAA + "minimum necessary" | Air-gap + HITRUST CSF attestation common for PHI |
| SCIF / classified | Encrypted drives, cleared installers, cross-domain media updates | No external connectivity; physical update channel only |
Total Cost of Ownership: On-Prem vs Cloud
The GPU sticker is only about 35% of five-year TCO — power, cooling, networking, redundancy, and staff make up the rest.
Table 26 — On-Prem GPU Server CAPEX (full system, Lenovo Press 2026, priced Jan 15 2026)
| Config | GPU Setup | GPU Memory | Price (USD) |
|---|---|---|---|
| A | 8x H100 | 80 GB | $250,141.80 |
| B | 8x H200 | 141 GB | $277,897.75 |
| C | 8x B200 | 192 GB | $338,495.75 |
| D | 8x B300 | 288 GB | $461,567.50 |
| E | 4x L40S | 48 GB | $52,390.50 |
An 8x H100 server pulls ~10 kW at full load (~$10,500/yr electricity at $0.12/kWh), with cooling adding ~30%. Staff is typically the single largest line item, exceeding hardware depreciation over three years:
Table 27 — 3-Year TCO of One 8x H100 SXM5 Server (Spheron cost model, 2026)
| Cost Category | Annual | 3-Year Total |
|---|---|---|
| Hardware depreciation | $116,000–150,000 | $350,000–450,000 |
| Power (~10 kW @ $0.12/kWh) | $10,500–10,700 | $31,500–32,100 |
| Cooling (~30% of power) | $3,150–3,210 | $9,450–9,630 |
| Datacenter / colocation | $12,000–24,000 | $36,000–72,000 |
| Networking (InfiniBand) | ~$10,000 | ~$30,000 |
| Storage (NVMe, object) | $5,000–8,000 | $15,000–24,000 |
| Staff (0.5 FTE engineer) | $75,000–100,000 | $225,000–300,000 |
| Maintenance / spares | $5,000–10,000 | $15,000–30,000 |
| TOTAL | ~$236,650–315,910 | ~$711,950–947,730 |
Table 28 — Break-Even Time, On-Prem 8x H100 vs Azure (Lenovo 2026)
| Cloud Pricing Tier | Rate ($/hr, 8-GPU server) | On-Prem Break-Even |
|---|---|---|
| Azure on-demand | $98.32 | ~3.7 months |
| Azure 1-year reserved | $62.92 | ~6 months |
| Azure 5-year reserved | $39.32 | ~10.4 months |
Table 29 — Per-Token Cost: On-Prem vs Cloud/API (Lenovo 2026)
| Model / Config | Throughput | On-Prem $/1M tokens | Cloud/API $/1M tokens | On-Prem advantage |
|---|---|---|---|---|
| Llama-70B, 8x H100 | 30,576 tok/s | $0.11 | $0.89 (Azure H100) | 8x |
| Llama-3.1-405B, 8x B300 | 1,360 tok/s | $4.74 | $29.09 (AWS) | 84% cheaper |
| GPT-5-mini-equivalent open model, 8x H100 | n/a | $0.11 | ~$2.00 (GPT-5 mini API) | ~18x |
Production Operations: Observability, Autoscaling, and Go-Live
Four pillars carry an on-prem LLM from "it runs" to "it runs reliably": observability, autoscaling, health/lifecycle, and go-live readiness.
vLLM exposes Prometheus metrics at /metrics. Monitor the golden signals: latency histograms (time_to_first_token, inter_token_latency, e2e_request_latency, request_queue_time), saturation gauges (num_requests_running, num_requests_waiting, kv_cache_usage_perc), and throughput/health counters (generation_tokens, num_preemptions, prefix-cache hit rate). Triage rule: if num_requests_waiting > 0 consistently, requests are queuing and TTFT is rising — add capacity; if num_requests_waiting == 0 but TTFT is still high, the bottleneck is prefill compute, not scaling. Healthy steady state is zero requests waiting with KV cache below 90%.
Standard Kubernetes HPA on CPU/memory is wrong for GPU inference — the GPU saturates while CPU stays low. Use KEDA scaling on queue depth (num_requests_waiting) per replica via a Prometheus trigger. A reference ScaledObject: threshold ~5 pending, minReplicaCount 1, maxReplicaCount 3, pollingInterval 15s, cooldownPeriod 360s. Model-weight load is the dominant pod-startup cost; a shared weights cache on an NFS-backed PVC cuts startup "from minutes to seconds," making reactive autoscaling feasible.
vLLM's /health confirms only that the engine process is alive — it does not verify the GPU can run a forward pass. Set Kubernetes readinessProbe (initialDelaySeconds 120) and livenessProbe (initialDelaySeconds 180) with high initial delays because model load takes minutes, and drain active streams gracefully on deploy. Version model weights, tokenizer, prompt templates, and inference config together with commit hashes; ship via stable deployment IDs with shadow traffic and canary rollout that auto-rolls-back on TTFT/TPS regression.
Before launch, run a saturation sweep with GuideLLM or genai-perf across realistic input/output lengths to find the knee and set P95/P99 SLOs from observed data. Token-aware rate limits, client retries with jitter, and idempotency keys round out the production posture. The full pre-launch checklist follows below.
Printable On-Prem LLM Requirements Checklist
- Model license cleared by legal (Apache 2.0 / MIT preferred; verify Llama 700M-MAU clause; review Gemma terms)
- Model selected by use case (chat / RAG / coding / reasoning / edge)
- MoE vs dense decision recorded (VRAM bills total params, compute bills active)
- Weights VRAM computed (params × bytes/param × 1.2)
- KV cache budgeted at target context AND concurrency (GQA/MLA-aware)
- Quantization chosen (W4A16 for latency/low-batch; W8A8/FP8 for throughput)
- Max concurrent requests per GPU derived from leftover VRAM
- GPU model selected on capacity AND bandwidth (not just VRAM) — or Intel Xeon + AirgapAI Edge for no-GPU CPU inference
- Precision support verified (FP8 needs Ada/Hopper+; FP4 needs Blackwell; AMX-INT8 on Xeon)
- NVLink present if using tensor parallelism; else plan pipeline parallelism
- InfiniBand/RoCE ≥100 Gbps + GPUDirect RDMA for multi-node TP
- Engine chosen (vLLM / NIM / SGLang / TensorRT-LLM) with rationale
- OpenAI-compatible endpoint + API-key auth configured
- --gpu-memory-utilization, --max-model-len, --max-num-seqs tuned
- Continuous batching + prefix caching confirmed on; speculative decoding benchmarked under real load
- --ipc=host --shm-size=16G / /dev/shm + IPC_LOCK set for multi-GPU
- Demand model built (concurrent users, RPS, in/out tokens)
- GPU count derived two ways (tokens/sec and queueing)
- SLO-driven utilization ceiling applied; scale trigger = concurrency, not CPU
- Peak-to-average headroom added
- All dependencies pre-staged inside enclave (incl. local embedding model)
- Two-phase download/verify/transfer workflow documented; SHA-256 verified
- Telemetry kill switches set (HF_HUB_OFFLINE, VLLM_NO_USAGE_STATS, NIM keys omitted)
- Private registry mirror frozen; packages version-pinned and scanned
- Internal PKI + mTLS; on-prem observability; signed-bundle update cadence defined
- Compliance mapping documented (FedRAMP / CMMC / HIPAA / IL level)
- Prometheus /metrics scraped; Grafana dashboards on golden signals
- Alerts on P95 TTFT regression, queue depth, KV%, preemptions, error rate
- KEDA autoscaling on queue depth validated under load
- Liveness + GPU-level readiness probes; graceful drain on deploy
- Load tested with GuideLLM/genai-perf; P95/P99 SLOs set from data
- Token-aware rate limits; client retries with jitter; idempotency keys
- Model artifacts versioned together; canary + auto-rollback; DR runbooks drilled
Put the Sizing Math to Work
An on-prem deployment is one chapter of a defensible enterprise AI program. Build the strategy behind the infrastructure, then turn this guide into a tailored deployment roadmap.
Frequently Asked Questions
Sources & References
Serving Engines (vLLM, NIM, TensorRT-LLM, SGLang)
- vLLM V1: A Major Upgrade to vLLM's Core Architecture
- vLLM: OpenAI-Compatible / Online Serving
- vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention
- Overview -- NVIDIA NIM for Large Language Models
- Pricing -- NVIDIA AI Enterprise Licensing Guide
- vLLM vs TensorRT-LLM vs SGLang: H100 Benchmarks (2026)
Sizing, VRAM & KV Cache
- NVIDIA: Mastering LLM Techniques -- Inference Optimization
- VMware: LLM Inference Sizing and Performance Guidance
- Lenovo Press: LLM Sizing Guide (LP2130)
- Modal: How much VRAM do I need for LLM inference?
- Kwon et al.: Efficient Memory Management for LLM Serving with PagedAttention (arXiv 2309.06180)
- DeepSeek-AI: DeepSeek-V3 Technical Report (arXiv 2412.19437)
GPUs, Quantization & Parallelism
- Spheron: NVIDIA H100 vs H200 -- Specs, FP8 Throughput & Cloud Pricing (2026)
- Spheron: GPU Requirements Cheat Sheet 2026
- Red Hat / Neural Magic: 'Give Me BF16 or Give Me Death'? (arXiv 2411.02355)
- vLLM: Quantization
- vLLM: Parallelism and Scaling
- Flash Communication: Reducing Tensor Parallelization Bottleneck (arXiv 2412.04964)
Air-Gap, TCO & Operations
- NVIDIA: Air-Gap Deployment for NIM LLMs
- TrueFoundry: Air-Gapped AI -- Deploying Enterprise LLMs in Regulated Industries
- Hugging Face Hub: Environment Variables Reference
- Spheron: On-Premise vs GPU Cloud -- 2026 Cost and Break-Even Analysis
- Lenovo Press: On-Premise vs Cloud Generative AI TCO (2026, LP2368)
- A Cost-Benefit Analysis of On-Premise LLM Deployment (arXiv 2509.18101)
- vLLM: Production Metrics
- vLLM: Autoscaling with KEDA (production-stack docs)