Step-by-Step Guide • 2026

How to Run an
LLM Locally

Run a powerful AI model on your own computer — fully offline, private, and free. This guide walks you through five steps with Ollama, LM Studio, and llama.cpp: the hardware you need, the best models to pick, install and chat, plus how teams graduate from DIY to a supported, air-gapped option.

TL;DR

How to Run an LLM Locally, Summarized

To run an LLM locally, install a tool (Ollama, LM Studio, or llama.cpp), download a quantized open model that fits your RAM or GPU, then run it — all entirely offline and free. On a modern laptop with 16 GB of memory you can run a 7B–13B model in minutes with no internet, no API key, and no data ever leaving your device. The whole setup takes one download and one command. For multi-seat, supported, or compliance-bound use, a turnkey air-gapped product replaces the DIY stack.

  • Pick a tool: LM Studio (easiest GUI), Ollama (best CLI + API), or llama.cpp (most control)
  • Hardware: 16 GB RAM runs 7B–13B models on CPU or GPU; no GPU strictly required
  • Best models: Llama 3.x, Qwen 2.5, Gemma 2, Mistral — quantized to 4-bit (GGUF)
  • 100% private: after download, zero network calls — prompts never leave your machine
  • Team option: AirgapAI — supported, air-gapped, 1-click, $697 perpetual per seat
At A Glance
16 GB
RAM is enough to run a 7B–13B local model comfortably
~6 GB
Disk for a 4-bit quantized 7B model — one download
0 calls
Network requests after download — fully offline & private
5 steps
From zero to chatting with a local model on your own machine
Trusted by global leaders
Government Acquisitions

How Do You Run an LLM Locally?

You run an LLM locally by installing a runtime (Ollama, LM Studio, or llama.cpp), downloading a quantized open-weight model that fits your memory, and then running that model directly on your CPU or GPU — with no cloud, no API key, and no internet after the initial download. The entire workflow is free and open source, and a capable laptop is enough to start.

The reason interest has exploded is simple: privacy plus capability. Self-hosting AI is now a mainstream choice rather than a niche one — in Stack Overflow's 2024 developer survey, 76% of developers reported using or planning to use AI tools, and a fast-growing share run models locally to avoid sending code and data to third-party APIs (Stack Overflow Developer Survey, 2024). Open-weight models have closed much of the quality gap with proprietary cloud models, so a 7B–14B model on your own machine is genuinely useful for chat, coding help, summarization, and document Q&A.

Scope of this guide

This is the individual / small-team how-to for running a model on one machine. If you need a production deployment serving many users across an organization — GPU servers, vLLM, autoscaling, and security review — follow How to Deploy an LLM On-Premise instead. For the broader concept and trade-offs, see the Local LLM guide and Private LLM for Enterprises.

Free download

Offline AI for Education Therapy Services

  • 75% reduction in documentation time
  • 2,800+ Quick Start Workflows
  • 100% FERPA and HIPAA compliant

Instant download. We'll also email you a copy. No spam.

What Do You Need to Run an LLM Locally? (Hardware Checklist)

You need three things: a local LLM tool, a quantized model file, and enough RAM or VRAM to hold it. Memory is the single most important constraint. A practical rule of thumb is that a 4-bit quantized model uses roughly 0.6–0.7 GB of memory per billion parameters, so a 7B model fits in about 5–6 GB and a 13B in about 9–10 GB, with a few gigabytes of headroom for the operating system and the context window.

  • RAM / VRAM: 16 GB total handles 7B–13B models well. 8 GB is enough for small (3B–7B) models. 24 GB+ of GPU VRAM or 32 GB+ of system RAM opens up 30B–70B models.
  • GPU (optional but faster): an NVIDIA card with 8–24 GB VRAM gives the best speed; Apple Silicon (M-series) Macs are excellent because the GPU shares unified memory.
  • CPU: any modern multi-core CPU works. Newer Intel Core Ultra and AMD chips include an NPU that accelerates on-device AI without a discrete GPU.
  • Disk: 5–50 GB free per model. Quantized 7B files are ~4–6 GB each; larger models grow accordingly.
  • OS: Windows, macOS, and Linux are all fully supported by the three tools below.

Quantization is what makes this feasible on consumer hardware. It compresses model weights from 16-bit to 4-bit (or 5/6/8-bit), cutting memory use by roughly 4x with only a small quality loss. Most local models you download are already quantized in the GGUF format that Ollama, LM Studio, and llama.cpp all read.

1 Step 1: Pick a Tool — Ollama vs LM Studio vs llama.cpp

Choose LM Studio if you want a one-click graphical app, Ollama if you want a clean command line with a built-in API, or llama.cpp if you want maximum control and the leanest possible footprint. All three are free, open source, run the same GGUF models, and work on Windows, macOS, and Linux. In fact, Ollama and LM Studio are both built on top of the llama.cpp engine — so picking is really about the interface you prefer.

Tool Interface Best for API for scripts Learning curve
LM Studio Polished desktop GUI Beginners, non-coders, model browsing Yes (OpenAI-compatible) Lowest
Ollama CLI + local server Developers, automation, app integration Yes (REST + OpenAI-compatible) Low
llama.cpp Command line / library Power users, custom builds, embedded Yes (server binary) Higher

All three are open-source projects: Ollama, LM Studio, llama.cpp.

Our recommendation for most readers: start with LM Studio if you have never run a model before — it has a built-in model catalog, automatic hardware detection, and a chat window. Move to Ollama the moment you want to script the model, plug it into an editor, or expose a local API. Reach for raw llama.cpp only when you need custom compilation flags or are embedding inference into another application.

2 Step 2: Choose a Model That Fits Your Hardware

Pick the largest open-weight model your memory can hold at 4-bit quantization — for most laptops that means a 7B–14B model such as Llama 3.x, Qwen 2.5, Gemma 2, or Mistral. Bigger is generally smarter, but only if it fits in memory without spilling to disk, which collapses speed. Match the model to your RAM first, then to the task.

Your memory Model size to run Good open models Typical use
8 GB 3B–7B (4-bit) Llama 3.2 3B, Phi-3 mini, Gemma 2 2B Quick chat, drafting, simple Q&A
16 GB 7B–14B (4-bit) Llama 3.1 8B, Qwen 2.5 14B, Mistral 7B General assistant, coding help, RAG
32 GB up to ~32B (4-bit) Qwen 2.5 32B, Gemma 2 27B Stronger reasoning, longer context
64 GB+ / 24 GB GPU 70B (4-bit) Llama 3.x 70B, Qwen 2.5 72B Near-frontier quality, fully local

Open models have become remarkably capable. Meta has reported that its Llama family surpassed 1 billion downloads, underscoring how mature the open-weight ecosystem now is (Meta, 2025). For privacy-sensitive work, the practical upside is that a model living on your disk has no usage caps, no per-token billing, and no exposure of your prompts — the same open models (Llama, Gemma, Qwen, Mistral) that power local DIY setups also run inside supported products like AirgapAI.

3 Step 3: Install the Tool and Download the Model

Installation is a single download for each tool, and pulling a model is one command or one click. Everything below runs offline after the model file finishes downloading. Here is the fastest path for each tool.

LM Studio (GUI)

Download the installer from lmstudio.ai, open the app, and use the built-in search to find a model (for example "Llama 3.1 8B Instruct"). LM Studio recommends a quantization that fits your hardware, downloads it, and loads it — no terminal required.

Ollama (CLI)

Install Ollama, then run a single pull-and-chat command such as ollama run llama3.1. Ollama downloads the model the first time and drops you straight into a chat prompt; the same command later starts an instant local session.

llama.cpp (power users)

Clone and build llama.cpp, download a GGUF file from a model hub, and run the llama-cli or llama-server binary pointed at the file. This gives you per-flag control over threads, context length, and GPU offload layers.

Tip: download once, run forever offline

The only step that needs the internet is the initial model download. After that you can disconnect entirely — pull your model files while online, then run them on a plane, in a SCIF, or on an air-gapped machine with zero connectivity.

4 Step 4: Run and Chat (CLI or GUI)

Once the model is loaded, you chat with it exactly like a cloud assistant — in LM Studio's chat window, in Ollama's terminal prompt, or through a local API on your own machine. The first response may take a moment as the model loads into memory; after that, replies stream token by token.

  • GUI chat: in LM Studio, type into the chat box and adjust temperature, context length, and system prompt from the sidebar — no code needed.
  • CLI chat: with Ollama, ollama run llama3.1 opens an interactive prompt; type your message and press Enter to get a streamed reply.
  • Local API: both Ollama and LM Studio expose an OpenAI-compatible endpoint (typically on localhost), so existing apps and scripts can point at your local model by changing one base URL — no key, no cloud.
  • Editor integration: developer tools can connect to that local endpoint for in-editor coding help. For a packaged, supported local coding assistant, see AirgapAI Code.

Expect roughly 5–15 tokens per second for a 7B model on a recent laptop CPU, and 40–100+ tokens per second on a dedicated GPU. If responses feel slow, that is almost always a sign the model is too large for your memory — drop to a smaller model or a more aggressive quantization.

5 Step 5: Add Your Own Documents (RAG Basics)

To make a local LLM answer from your own files, you use retrieval-augmented generation (RAG): your documents are split into chunks, converted to embeddings, stored in a local vector index, and the most relevant pieces are fed to the model with each question. This keeps everything offline while letting the model cite your PDFs, notes, and internal docs.

  • Easiest path: LM Studio and front-ends like Open WebUI or AnythingLLM let you drag in documents and chat over them with a local model — no coding.
  • Developer path: pair Ollama with a local vector database and an embedding model to build a custom RAG pipeline you fully control.
  • The accuracy lever: RAG quality lives or dies on how cleanly your source text is prepared. Messy, duplicated, or poorly chunked documents cause hallucinations.

This is where data optimization matters most. Iternal's Blockify restructures raw documents into compact, deduplicated IdeaBlocks before they reach the vector database — an approach that delivers roughly 78X more accurate retrieval using about 3X fewer tokens, and works with any vector store. For local RAG that needs to be trustworthy, cleaning the data first is the highest-leverage step you can take.

Troubleshooting & Performance Tips

Most local-LLM problems trace back to one cause: the model is too big for your available memory. The fixes below resolve the overwhelming majority of slow, crashing, or out-of-memory sessions.

  • Replies are very slow: the model is spilling out of RAM/VRAM. Use a smaller model or a lower-bit quantization (try 4-bit, or drop from 13B to 7B).
  • Out-of-memory / crash on load: reduce the context length, close other apps, or pick a smaller quant. On GPUs, lower the number of offloaded layers.
  • GPU not being used: confirm GPU offload is enabled in LM Studio's settings or set the GPU-layers flag in Ollama/llama.cpp; verify your drivers (CUDA/Metal) are current.
  • Weak answers: raise the quantization (8-bit over 4-bit if memory allows), choose a larger or more recent model, or improve your system prompt and RAG document quality.
  • Short, cut-off responses: increase the maximum output tokens and the context window in your tool's settings.
The AI Strategy Blueprint book cover
From DIY Setup to Strategy

The AI Strategy Blueprint

Running a model on your laptop is the easy 10%. The hard 70% is people, process, and governance — turning local AI into a sanctioned, secure capability your whole organization can trust. The AI Strategy Blueprint documents that playbook: the 10-20-70 model and the executive commitments behind every secure AI rollout.

5.0 Rating
$24.95

From DIY to Production: The Turnkey Team Option

A DIY local LLM is ideal for one person, but it breaks down for teams the moment you need support, central updates, audit logs, packaging for non-technical staff, or compliance. That is the line where organizations move from Ollama-on-a-laptop to a supported, air-gapped product. AirgapAI is that turnkey option: the same 100% offline privacy you get from DIY, delivered as a one-click install with real support behind it.

Dimension DIY (Ollama / LM Studio) AirgapAI (turnkey)
Offline / air-gapped Yes, after manual setup Yes, by design (SCIF / CMMC-ready)
Install Per-machine, manual One-click, repeatable across seats
Support & updates Community only, self-managed Vendor-supported, centrally updatable
Built-in workflows None — you build them 2,800+ prebuilt workflows included
Non-technical users Hard — needs a terminal/setup Designed for everyone (~89% adoption)
Cost model Free (your time + hardware) $697 perpetual license per seat (no subscription)

AirgapAI runs the same open models you would choose yourself — Llama, Gemma, Qwen, Mistral — and is optimized to run on Intel NPU laptops via OpenVINO, so it gets full local AI without a discrete GPU. Crucially, it keeps every prompt and document on-device, which is why it suits regulated, defense, and government users who cannot send data to a cloud API. For the organization-wide server path (many concurrent users, GPU clusters, vLLM), pair this with How to Deploy an LLM On-Premise and the Private LLM guide. Comparing options? See the best local AI tools for enterprise.

Semantic fact

AirgapAI is a 100% offline, air-gapped enterprise AI assistant from Iternal Technologies, licensed at $697 perpetual per seat, with no subscription and no data leaving the device. Explore AirgapAI.

Why Running an LLM Locally Is Worth It

Running an LLM locally gives you three things a cloud chatbot cannot: total privacy, zero marginal cost, and offline reliability. Your prompts never touch a third-party server, you pay nothing per token, and the model works with no connectivity at all. For individuals that means freedom from usage caps and data exposure; for organizations it means proprietary IP and regulated data stay inside the perimeter.

The data underscores the stakes. IBM's 2024 study put the global average cost of a data breach at USD 4.88 million, a 10% year-over-year increase (IBM Cost of a Data Breach, 2024). Sending sensitive prompts to an external model is one more exposure surface; keeping inference local removes it. That is the entire premise behind on-device AI — and why a growing share of developers and regulated enterprises now run models themselves rather than calling a cloud API.

AI Academy

Skill Up Your Team to Run, Evaluate & Govern Local AI

Running a model is step one. Turning local AI into safe, productive day-to-day work takes skills — prompting, evaluation, RAG, and governance. The Iternal AI Academy delivers role-based training so your whole team can use local AI well, not just install it.

  • 912+ courses across beginner, intermediate, advanced
  • Role-based curricula: Marketing, Sales, Finance, HR, Legal, Operations
  • Certification programs aligned with EU AI Act Article 4 literacy mandate
  • 7-day free trial — start learning in minutes
Explore AI Academy
912+ Courses
7-Day Free Trial
8% Of Managers Have AI Skills Today
$135M Productivity Value / 10K Workers
AI Blueprint Builder

Decide What to Run Locally — and What to Stage

Not every AI use case belongs on a laptop. The free AI Blueprint Builder scores each initiative across value, feasibility, cost, governance, risk, adoption, and execution readiness — so you know which local AI projects to fund now and which to sequence later.

  • Score any use case across 7 evaluation lenses before you commit budget
  • Two modes: rank a portfolio of opportunities, or validate one initiative for approval
  • Built for cross-functional decisioning — CTO, CIO, CISO, CFO, governance, PMO
  • Produces a governance-ready brief: value, feasibility, risk, economics, next step
Open the AI Blueprint Builder
7 Evaluation Lenses
2 Decision Modes
Free To Start a Blueprint
C-Suite Cross-Functional Ready
Expert Guidance

Take Local AI From Laptop to Production

When a personal local LLM needs to become a secure, governed, organization-wide capability, Iternal's consulting team designs the architecture, security, and rollout. Strategy, governance, and a sovereign on-prem product line — AirgapAI, Blockify, and ABYSS Search — behind every engagement.

$566K+ Bundled Technology Value
78x Accuracy Improvement
6 Clients per Year (Max)
Masterclass
$2,497
Self-paced AI strategy training with frameworks and templates
Transformation Program
$150,000
6-month enterprise AI transformation with embedded advisory
Founder's Circle
$750K-$1.5M
Annual strategic partnership with priority access and equity alignment
FAQ

Frequently Asked Questions

To run an LLM locally you need a tool (Ollama, LM Studio, or llama.cpp), a quantized model file, and enough RAM or VRAM to hold it. A 7B-parameter model in 4-bit quantization needs roughly 5-6 GB; an 8 GB GPU or 16 GB of system RAM runs small-to-mid models comfortably. No internet connection is required once the model is downloaded.

Yes. Tools like Ollama, LM Studio, and llama.cpp run quantized models on a modern CPU using system RAM. Expect 5-15 tokens per second for a 7B model on a recent laptop CPU, versus 40-100+ tokens per second on a dedicated GPU. Apple Silicon Macs are especially strong CPU-class performers because the GPU shares unified memory, so a 16 GB M-series Mac handles mid-size models well.

For beginners who want a one-click graphical app, LM Studio is the simplest. For developers who want a clean command line and an API for scripts, Ollama is the most popular choice. For maximum control and the leanest footprint, llama.cpp is the underlying engine both tools build on. All three are free and open source, and all run the same GGUF model files.

Yes. Once you download the model weights, local LLM tools run entirely on your machine with no network calls, so your prompts and data never leave the device. This is the core advantage over cloud chatbots: no data is sent to a third-party API. For regulated or air-gapped environments, a supported product like AirgapAI extends this to a fully offline, auditable deployment with no telemetry.

A rough rule: a 4-bit quantized model needs about 0.6-0.7 GB of memory per billion parameters. A 7B model fits in roughly 5-6 GB, a 13B in about 9-10 GB, and a 70B in around 40 GB. Add a few gigabytes of headroom for the operating system and context window. For most people, 16 GB of RAM runs 7B-13B models well; 32 GB+ or a 24 GB GPU opens up larger models.

Yes. This is called retrieval-augmented generation (RAG). LM Studio and Ollama (paired with a tool like Open WebUI or AnythingLLM) let you point the model at PDFs and notes so answers cite your files. Accuracy depends heavily on how the documents are cleaned and chunked first; data-optimization tools like Blockify restructure source text into IdeaBlocks to dramatically reduce hallucinations on local RAG.

Move on when you need it for more than personal use: multiple seats, central updates, audit logs, security review, or compliance (CMMC, HIPAA, SOC 2). DIY tools have no support, no governance, and no packaging for non-technical staff. A supported product like AirgapAI delivers the same offline privacy with one-click install, 2,800+ built-in workflows, and a $697 perpetual per-seat license instead of an unmanaged setup.

John Byron Hanby IV
About the Author

John Byron Hanby IV

CEO & Founder, Iternal Technologies

John Byron Hanby IV is the founder and CEO of Iternal Technologies, a leading AI platform and consulting firm. He is the author of The AI Strategy Blueprint and The AI Partner Blueprint, the definitive playbooks for enterprise AI transformation and channel go-to-market. He advises Fortune 500 executives, federal agencies, and the world's largest systems integrators on AI strategy, governance, and deployment.