Step-by-Step Guide • 2026

How to Run an
LLM Locally

Run a powerful AI model on your own computer — fully offline, private, and free. This guide walks you through five steps with Ollama, LM Studio, and llama.cpp: the hardware you need, the best models to pick, install and chat, plus how teams graduate from DIY to a supported, air-gapped option.

By John Byron Hanby IV

CEO & Founder, Iternal Technologies • Updated June 2026 • 11 min read

Start the 5-Step Setup

TL;DR

How to Run an LLM Locally, Summarized

To run an LLM locally, install a tool (Ollama, LM Studio, or llama.cpp), download a quantized open model that fits your RAM or GPU, then run it — all entirely offline and free. On a modern laptop with 16 GB of memory you can run a 7B–13B model in minutes with no internet, no API key, and no data ever leaving your device. The whole setup takes one download and one command. For multi-seat, supported, or compliance-bound use, a turnkey air-gapped product replaces the DIY stack.

Pick a tool: LM Studio (easiest GUI), Ollama (best CLI + API), or llama.cpp (most control)
Hardware: 16 GB RAM runs 7B–13B models on CPU or GPU; no GPU strictly required
Best models: Llama 3.x, Qwen 2.5, Gemma 2, Mistral — quantized to 4-bit (GGUF)
100% private: after download, zero network calls — prompts never leave your machine
Team option: AirgapAI — supported, air-gapped, 1-click, $697 perpetual per seat

At A Glance

16 GB

RAM is enough to run a 7B–13B local model comfortably

~6 GB

Disk for a 4-bit quantized 7B model — one download

0 calls

Network requests after download — fully offline & private

5 steps

From zero to chatting with a local model on your own machine

Table of Contents

How to run an LLM locally (overview)
What you need (hardware checklist)
Step 1: Pick a tool (Ollama vs LM Studio vs llama.cpp)
Step 2: Choose a model that fits your hardware
Step 3: Install and download the model
Step 4: Run and chat (CLI / GUI)
Step 5: Add your own documents (RAG basics)
Troubleshooting & performance tips
From DIY to production: the turnkey team option
Skill up your team
Frequently Asked Questions

Trusted by global leaders

How Do You Run an LLM Locally?

You run an LLM locally by installing a runtime (Ollama, LM Studio, or llama.cpp), downloading a quantized open-weight model that fits your memory, and then running that model directly on your CPU or GPU — with no cloud, no API key, and no internet after the initial download. The entire workflow is free and open source, and a capable laptop is enough to start.

The reason interest has exploded is simple: privacy plus capability. Self-hosting AI is now a mainstream choice rather than a niche one — in Stack Overflow's 2024 developer survey, 76% of developers reported using or planning to use AI tools, and a fast-growing share run models locally to avoid sending code and data to third-party APIs (Stack Overflow Developer Survey, 2024). Open-weight models have closed much of the quality gap with proprietary cloud models, so a 7B–14B model on your own machine is genuinely useful for chat, coding help, summarization, and document Q&A.

Scope of this guide

This is the individual / small-team how-to for running a model on one machine. If you need a production deployment serving many users across an organization — GPU servers, vLLM, autoscaling, and security review — follow How to Deploy an LLM On-Premise instead. For the broader concept and trade-offs, see the Local LLM guide and Private LLM for Enterprises.

Free download

Offline AI for Education Therapy Services

75% reduction in documentation time
2,800+ Quick Start Workflows
100% FERPA and HIPAA compliant

Instant download. We'll also email you a copy. No spam.

What Do You Need to Run an LLM Locally? (Hardware Checklist)

You need three things: a local LLM tool, a quantized model file, and enough RAM or VRAM to hold it. Memory is the single most important constraint. A practical rule of thumb is that a 4-bit quantized model uses roughly 0.6–0.7 GB of memory per billion parameters, so a 7B model fits in about 5–6 GB and a 13B in about 9–10 GB, with a few gigabytes of headroom for the operating system and the context window.

RAM / VRAM: 16 GB total handles 7B–13B models well. 8 GB is enough for small (3B–7B) models. 24 GB+ of GPU VRAM or 32 GB+ of system RAM opens up 30B–70B models.
GPU (optional but faster): an NVIDIA card with 8–24 GB VRAM gives the best speed; Apple Silicon (M-series) Macs are excellent because the GPU shares unified memory.
CPU: any modern multi-core CPU works. Newer Intel Core Ultra and AMD chips include an NPU that accelerates on-device AI without a discrete GPU.
Disk: 5–50 GB free per model. Quantized 7B files are ~4–6 GB each; larger models grow accordingly.
OS: Windows, macOS, and Linux are all fully supported by the three tools below.

Quantization is what makes this feasible on consumer hardware. It compresses model weights from 16-bit to 4-bit (or 5/6/8-bit), cutting memory use by roughly 4x with only a small quality loss. Most local models you download are already quantized in the GGUF format that Ollama, LM Studio, and llama.cpp all read.

1 Step 1: Pick a Tool — Ollama vs LM Studio vs llama.cpp

Choose LM Studio if you want a one-click graphical app, Ollama if you want a clean command line with a built-in API, or llama.cpp if you want maximum control and the leanest possible footprint. All three are free, open source, run the same GGUF models, and work on Windows, macOS, and Linux. In fact, Ollama and LM Studio are both built on top of the llama.cpp engine — so picking is really about the interface you prefer.

Tool	Interface	Best for	API for scripts	Learning curve
LM Studio	Polished desktop GUI	Beginners, non-coders, model browsing	Yes (OpenAI-compatible)	Lowest
Ollama	CLI + local server	Developers, automation, app integration	Yes (REST + OpenAI-compatible)	Low
llama.cpp	Command line / library	Power users, custom builds, embedded	Yes (server binary)	Higher

All three are open-source projects: Ollama, LM Studio, llama.cpp.

Our recommendation for most readers: start with LM Studio if you have never run a model before — it has a built-in model catalog, automatic hardware detection, and a chat window. Move to Ollama the moment you want to script the model, plug it into an editor, or expose a local API. Reach for raw llama.cpp only when you need custom compilation flags or are embedding inference into another application.

2 Step 2: Choose a Model That Fits Your Hardware

Pick the largest open-weight model your memory can hold at 4-bit quantization — for most laptops that means a 7B–14B model such as Llama 3.x, Qwen 2.5, Gemma 2, or Mistral. Bigger is generally smarter, but only if it fits in memory without spilling to disk, which collapses speed. Match the model to your RAM first, then to the task.

Your memory	Model size to run	Good open models	Typical use
8 GB	3B–7B (4-bit)	Llama 3.2 3B, Phi-3 mini, Gemma 2 2B	Quick chat, drafting, simple Q&A
16 GB	7B–14B (4-bit)	Llama 3.1 8B, Qwen 2.5 14B, Mistral 7B	General assistant, coding help, RAG
32 GB	up to ~32B (4-bit)	Qwen 2.5 32B, Gemma 2 27B	Stronger reasoning, longer context
64 GB+ / 24 GB GPU	70B (4-bit)	Llama 3.x 70B, Qwen 2.5 72B	Near-frontier quality, fully local

Open models have become remarkably capable. Meta has reported that its Llama family surpassed 1 billion downloads, underscoring how mature the open-weight ecosystem now is (Meta, 2025). For privacy-sensitive work, the practical upside is that a model living on your disk has no usage caps, no per-token billing, and no exposure of your prompts — the same open models (Llama, Gemma, Qwen, Mistral) that power local DIY setups also run inside supported products like AirgapAI.

3 Step 3: Install the Tool and Download the Model

Installation is a single download for each tool, and pulling a model is one command or one click. Everything below runs offline after the model file finishes downloading. Here is the fastest path for each tool.

LM Studio (GUI)

Download the installer from lmstudio.ai, open the app, and use the built-in search to find a model (for example "Llama 3.1 8B Instruct"). LM Studio recommends a quantization that fits your hardware, downloads it, and loads it — no terminal required.

Ollama (CLI)

Install Ollama, then run a single pull-and-chat command such as ollama run llama3.1. Ollama downloads the model the first time and drops you straight into a chat prompt; the same command later starts an instant local session.

llama.cpp (power users)

Clone and build llama.cpp, download a GGUF file from a model hub, and run the llama-cli or llama-server binary pointed at the file. This gives you per-flag control over threads, context length, and GPU offload layers.

Tip: download once, run forever offline

The only step that needs the internet is the initial model download. After that you can disconnect entirely — pull your model files while online, then run them on a plane, in a SCIF, or on an air-gapped machine with zero connectivity.

4 Step 4: Run and Chat (CLI or GUI)

Once the model is loaded, you chat with it exactly like a cloud assistant — in LM Studio's chat window, in Ollama's terminal prompt, or through a local API on your own machine. The first response may take a moment as the model loads into memory; after that, replies stream token by token.

GUI chat: in LM Studio, type into the chat box and adjust temperature, context length, and system prompt from the sidebar — no code needed.
CLI chat: with Ollama, ollama run llama3.1 opens an interactive prompt; type your message and press Enter to get a streamed reply.
Local API: both Ollama and LM Studio expose an OpenAI-compatible endpoint (typically on localhost), so existing apps and scripts can point at your local model by changing one base URL — no key, no cloud.
Editor integration: developer tools can connect to that local endpoint for in-editor coding help. For a packaged, supported local coding assistant, see AirgapAI Code.

Expect roughly 5–15 tokens per second for a 7B model on a recent laptop CPU, and 40–100+ tokens per second on a dedicated GPU. If responses feel slow, that is almost always a sign the model is too large for your memory — drop to a smaller model or a more aggressive quantization.

5 Step 5: Add Your Own Documents (RAG Basics)

To make a local LLM answer from your own files, you use retrieval-augmented generation (RAG): your documents are split into chunks, converted to embeddings, stored in a local vector index, and the most relevant pieces are fed to the model with each question. This keeps everything offline while letting the model cite your PDFs, notes, and internal docs.

Easiest path: LM Studio and front-ends like Open WebUI or AnythingLLM let you drag in documents and chat over them with a local model — no coding.
Developer path: pair Ollama with a local vector database and an embedding model to build a custom RAG pipeline you fully control.
The accuracy lever: RAG quality lives or dies on how cleanly your source text is prepared. Messy, duplicated, or poorly chunked documents cause hallucinations.

This is where data optimization matters most. Iternal's Blockify restructures raw documents into compact, deduplicated IdeaBlocks before they reach the vector database — an approach that delivers roughly 78X more accurate retrieval using about 3X fewer tokens, and works with any vector store. For local RAG that needs to be trustworthy, cleaning the data first is the highest-leverage step you can take.

Troubleshooting & Performance Tips

Most local-LLM problems trace back to one cause: the model is too big for your available memory. The fixes below resolve the overwhelming majority of slow, crashing, or out-of-memory sessions.

Replies are very slow: the model is spilling out of RAM/VRAM. Use a smaller model or a lower-bit quantization (try 4-bit, or drop from 13B to 7B).
Out-of-memory / crash on load: reduce the context length, close other apps, or pick a smaller quant. On GPUs, lower the number of offloaded layers.
GPU not being used: confirm GPU offload is enabled in LM Studio's settings or set the GPU-layers flag in Ollama/llama.cpp; verify your drivers (CUDA/Metal) are current.
Weak answers: raise the quantization (8-bit over 4-bit if memory allows), choose a larger or more recent model, or improve your system prompt and RAG document quality.
Short, cut-off responses: increase the maximum output tokens and the context window in your tool's settings.

From DIY to Production: The Turnkey Team Option

A DIY local LLM is ideal for one person, but it breaks down for teams the moment you need support, central updates, audit logs, packaging for non-technical staff, or compliance. That is the line where organizations move from Ollama-on-a-laptop to a supported, air-gapped product. AirgapAI is that turnkey option: the same 100% offline privacy you get from DIY, delivered as a one-click install with real support behind it.

Dimension	DIY (Ollama / LM Studio)	AirgapAI (turnkey)
Offline / air-gapped	Yes, after manual setup	Yes, by design (SCIF / CMMC-ready)
Install	Per-machine, manual	One-click, repeatable across seats
Support & updates	Community only, self-managed	Vendor-supported, centrally updatable
Built-in workflows	None — you build them	2,800+ prebuilt workflows included
Non-technical users	Hard — needs a terminal/setup	Designed for everyone (~89% adoption)
Cost model	Free (your time + hardware)	$697 perpetual license per seat (no subscription)

AirgapAI runs the same open models you would choose yourself — Llama, Gemma, Qwen, Mistral — and is optimized to run on Intel NPU laptops via OpenVINO, so it gets full local AI without a discrete GPU. Crucially, it keeps every prompt and document on-device — a fully offline AI deployment — which is why it suits regulated, defense, and government users who cannot send data to a cloud API. For the organization-wide server path (many concurrent users, GPU clusters, vLLM), pair this with How to Deploy an LLM On-Premise and the Private LLM guide. Comparing options? See the best local AI tools for enterprise.

Semantic fact

AirgapAI is a 100% offline, air-gapped enterprise AI assistant from Iternal Technologies, licensed at $697 perpetual per seat, with no subscription and no data leaving the device. Explore AirgapAI.

Why Running an LLM Locally Is Worth It

Running an LLM locally gives you three things a cloud chatbot cannot: total privacy, zero marginal cost, and offline reliability. Your prompts never touch a third-party server, you pay nothing per token, and the model works with no connectivity at all. For individuals that means freedom from usage caps and data exposure; for organizations it means proprietary IP and regulated data stay inside the perimeter.

The data underscores the stakes. IBM's 2024 study put the global average cost of a data breach at USD 4.88 million, a 10% year-over-year increase (IBM Cost of a Data Breach, 2024). Sending sensitive prompts to an external model is one more exposure surface; keeping inference local removes it. That is the entire premise behind on-device AI — and why a growing share of developers and regulated enterprises now run models themselves rather than calling a cloud API.

AI Academy

Skill Up Your Team to Run, Evaluate & Govern Local AI

Running a model is step one. Turning local AI into safe, productive day-to-day work takes skills — prompting, evaluation, RAG, and governance. The Iternal AI Academy delivers role-based training so your whole team can use local AI well, not just install it.

912+ courses across beginner, intermediate, advanced
Role-based curricula: Marketing, Sales, Finance, HR, Legal, Operations
Certification programs aligned with EU AI Act Article 4 literacy mandate
7-day free trial — start learning in minutes

Explore AI Academy

912+ Courses

7-Day Free Trial

8% Of Managers Have AI Skills Today

$135M Productivity Value / 10K Workers

AI Blueprint Builder

Decide What to Run Locally — and What to Stage

Not every AI use case belongs on a laptop. The free AI Blueprint Builder scores each initiative across value, feasibility, cost, governance, risk, adoption, and execution readiness — so you know which local AI projects to fund now and which to sequence later.

Score any use case across 7 evaluation lenses before you commit budget
Two modes: rank a portfolio of opportunities, or validate one initiative for approval
Built for cross-functional decisioning — CTO, CIO, CISO, CFO, governance, PMO
Produces a governance-ready brief: value, feasibility, risk, economics, next step

Open the AI Blueprint Builder

7 Evaluation Lenses

2 Decision Modes

Free To Start a Blueprint

C-Suite Cross-Functional Ready

Expert Guidance

Take Local AI From Laptop to Production

When a personal local LLM needs to become a secure, governed, organization-wide capability, Iternal's consulting team designs the architecture, security, and rollout. Strategy, governance, and a sovereign on-prem product line — AirgapAI, Blockify, and ABYSS Search — behind every engagement.

$566K+ Bundled Technology Value

78x Accuracy Improvement

6 Clients per Year (Max)

Masterclass

$2,497

Self-paced AI strategy training with frameworks and templates

Frequently Asked Questions

What do I need to run an LLM locally?

To run an LLM locally you need a tool (Ollama, LM Studio, or llama.cpp), a quantized model file, and enough RAM or VRAM to hold it. A 7B-parameter model in 4-bit quantization needs roughly 5-6 GB; an 8 GB GPU or 16 GB of system RAM runs small-to-mid models comfortably. No internet connection is required once the model is downloaded.

Can I run an LLM locally without a GPU?

Yes. Tools like Ollama, LM Studio, and llama.cpp run quantized models on a modern CPU using system RAM. Expect 5-15 tokens per second for a 7B model on a recent laptop CPU, versus 40-100+ tokens per second on a dedicated GPU. Apple Silicon Macs are especially strong CPU-class performers because the GPU shares unified memory, so a 16 GB M-series Mac handles mid-size models well.

Which is the best tool to run a local LLM in 2026?

For beginners who want a one-click graphical app, LM Studio is the simplest. For developers who want a clean command line and an API for scripts, Ollama is the most popular choice. For maximum control and the leanest footprint, llama.cpp is the underlying engine both tools build on. All three are free and open source, and all run the same GGUF model files.

Is running a local LLM actually private and offline?

Yes. Once you download the model weights, local LLM tools run entirely on your machine with no network calls, so your prompts and data never leave the device. This is the core advantage over cloud chatbots: no data is sent to a third-party API. For regulated or air-gapped environments, a supported product like AirgapAI extends this to a fully offline, auditable deployment with no telemetry.

How much RAM do I need to run a local LLM?

A rough rule: a 4-bit quantized model needs about 0.6-0.7 GB of memory per billion parameters. A 7B model fits in roughly 5-6 GB, a 13B in about 9-10 GB, and a 70B in around 40 GB. Add a few gigabytes of headroom for the operating system and context window. For most people, 16 GB of RAM runs 7B-13B models well; 32 GB+ or a 24 GB GPU opens up larger models.

Can I chat with my own documents using a local LLM?

Yes. This is called retrieval-augmented generation (RAG). LM Studio and Ollama (paired with a tool like Open WebUI or AnythingLLM) let you point the model at PDFs and notes so answers cite your files. Accuracy depends heavily on how the documents are cleaned and chunked first; data-optimization tools like Blockify restructure source text into IdeaBlocks to dramatically reduce hallucinations on local RAG.

When should a team move from a DIY local LLM to a turnkey product?

Move on when you need it for more than personal use: multiple seats, central updates, audit logs, security review, or compliance (CMMC, HIPAA, SOC 2). DIY tools have no support, no governance, and no packaging for non-technical staff. A supported product like AirgapAI delivers the same offline privacy with one-click install, 2,800+ built-in workflows, and a $697 perpetual per-seat license instead of an unmanaged setup.

About the Author

John Byron Hanby IV

CEO & Founder, Iternal Technologies

John Byron Hanby IV is the founder and CEO of Iternal Technologies, a leading AI platform and consulting firm. He is the author of The AI Strategy Blueprint and The AI Partner Blueprint, the definitive playbooks for enterprise AI transformation and channel go-to-market. He advises Fortune 500 executives, federal agencies, and the world's largest systems integrators on AI strategy, governance, and deployment.

G Grokipedia LinkedIn X Leadership Team

How to Run an LLM Locally

How to Run an LLM Locally, Summarized

How Do You Run an LLM Locally?

Offline AI for Education Therapy Services

Your download is ready.

What Do You Need to Run an LLM Locally? (Hardware Checklist)

1 Step 1: Pick a Tool — Ollama vs LM Studio vs llama.cpp

2 Step 2: Choose a Model That Fits Your Hardware

3 Step 3: Install the Tool and Download the Model

LM Studio (GUI)

Ollama (CLI)

llama.cpp (power users)

4 Step 4: Run and Chat (CLI or GUI)

5 Step 5: Add Your Own Documents (RAG Basics)

Troubleshooting & Performance Tips

The AI Strategy Blueprint

From DIY to Production: The Turnkey Team Option

Why Running an LLM Locally Is Worth It

Skill Up Your Team to Run, Evaluate & Govern Local AI

Decide What to Run Locally — and What to Stage

Take Local AI From Laptop to Production

More from The AI Strategy Blueprint

Local LLM: The Complete Guide

Private LLM for Enterprises

How to Deploy an LLM On-Premise

AirgapAI: Offline AI Assistant

Best Local AI Tools for Enterprise

AI Training & AI Academy

Frequently Asked Questions

John Byron Hanby IV

How to Run an
LLM Locally