How Do You Run an LLM Locally?
You run an LLM locally by installing a runtime (Ollama, LM Studio, or llama.cpp), downloading a quantized open-weight model that fits your memory, and then running that model directly on your CPU or GPU — with no cloud, no API key, and no internet after the initial download. The entire workflow is free and open source, and a capable laptop is enough to start.
The reason interest has exploded is simple: privacy plus capability. Self-hosting AI is now a mainstream choice rather than a niche one — in Stack Overflow's 2024 developer survey, 76% of developers reported using or planning to use AI tools, and a fast-growing share run models locally to avoid sending code and data to third-party APIs (Stack Overflow Developer Survey, 2024). Open-weight models have closed much of the quality gap with proprietary cloud models, so a 7B–14B model on your own machine is genuinely useful for chat, coding help, summarization, and document Q&A.
This is the individual / small-team how-to for running a model on one machine. If you need a production deployment serving many users across an organization — GPU servers, vLLM, autoscaling, and security review — follow How to Deploy an LLM On-Premise instead. For the broader concept and trade-offs, see the Local LLM guide and Private LLM for Enterprises.
Offline AI for Education Therapy Services
- 75% reduction in documentation time
- 2,800+ Quick Start Workflows
- 100% FERPA and HIPAA compliant
Instant download. We'll also email you a copy. No spam.
What Do You Need to Run an LLM Locally? (Hardware Checklist)
You need three things: a local LLM tool, a quantized model file, and enough RAM or VRAM to hold it. Memory is the single most important constraint. A practical rule of thumb is that a 4-bit quantized model uses roughly 0.6–0.7 GB of memory per billion parameters, so a 7B model fits in about 5–6 GB and a 13B in about 9–10 GB, with a few gigabytes of headroom for the operating system and the context window.
- RAM / VRAM: 16 GB total handles 7B–13B models well. 8 GB is enough for small (3B–7B) models. 24 GB+ of GPU VRAM or 32 GB+ of system RAM opens up 30B–70B models.
- GPU (optional but faster): an NVIDIA card with 8–24 GB VRAM gives the best speed; Apple Silicon (M-series) Macs are excellent because the GPU shares unified memory.
- CPU: any modern multi-core CPU works. Newer Intel Core Ultra and AMD chips include an NPU that accelerates on-device AI without a discrete GPU.
- Disk: 5–50 GB free per model. Quantized 7B files are ~4–6 GB each; larger models grow accordingly.
- OS: Windows, macOS, and Linux are all fully supported by the three tools below.
Quantization is what makes this feasible on consumer hardware. It compresses model weights from 16-bit to 4-bit (or 5/6/8-bit), cutting memory use by roughly 4x with only a small quality loss. Most local models you download are already quantized in the GGUF format that Ollama, LM Studio, and llama.cpp all read.
1 Step 1: Pick a Tool — Ollama vs LM Studio vs llama.cpp
Choose LM Studio if you want a one-click graphical app, Ollama if you want a clean command line with a built-in API, or llama.cpp if you want maximum control and the leanest possible footprint. All three are free, open source, run the same GGUF models, and work on Windows, macOS, and Linux. In fact, Ollama and LM Studio are both built on top of the llama.cpp engine — so picking is really about the interface you prefer.
| Tool | Interface | Best for | API for scripts | Learning curve |
|---|---|---|---|---|
| LM Studio | Polished desktop GUI | Beginners, non-coders, model browsing | Yes (OpenAI-compatible) | Lowest |
| Ollama | CLI + local server | Developers, automation, app integration | Yes (REST + OpenAI-compatible) | Low |
| llama.cpp | Command line / library | Power users, custom builds, embedded | Yes (server binary) | Higher |
All three are open-source projects: Ollama, LM Studio, llama.cpp.
Our recommendation for most readers: start with LM Studio if you have never run a model before — it has a built-in model catalog, automatic hardware detection, and a chat window. Move to Ollama the moment you want to script the model, plug it into an editor, or expose a local API. Reach for raw llama.cpp only when you need custom compilation flags or are embedding inference into another application.
2 Step 2: Choose a Model That Fits Your Hardware
Pick the largest open-weight model your memory can hold at 4-bit quantization — for most laptops that means a 7B–14B model such as Llama 3.x, Qwen 2.5, Gemma 2, or Mistral. Bigger is generally smarter, but only if it fits in memory without spilling to disk, which collapses speed. Match the model to your RAM first, then to the task.
| Your memory | Model size to run | Good open models | Typical use |
|---|---|---|---|
| 8 GB | 3B–7B (4-bit) | Llama 3.2 3B, Phi-3 mini, Gemma 2 2B | Quick chat, drafting, simple Q&A |
| 16 GB | 7B–14B (4-bit) | Llama 3.1 8B, Qwen 2.5 14B, Mistral 7B | General assistant, coding help, RAG |
| 32 GB | up to ~32B (4-bit) | Qwen 2.5 32B, Gemma 2 27B | Stronger reasoning, longer context |
| 64 GB+ / 24 GB GPU | 70B (4-bit) | Llama 3.x 70B, Qwen 2.5 72B | Near-frontier quality, fully local |
Open models have become remarkably capable. Meta has reported that its Llama family surpassed 1 billion downloads, underscoring how mature the open-weight ecosystem now is (Meta, 2025). For privacy-sensitive work, the practical upside is that a model living on your disk has no usage caps, no per-token billing, and no exposure of your prompts — the same open models (Llama, Gemma, Qwen, Mistral) that power local DIY setups also run inside supported products like AirgapAI.
3 Step 3: Install the Tool and Download the Model
Installation is a single download for each tool, and pulling a model is one command or one click. Everything below runs offline after the model file finishes downloading. Here is the fastest path for each tool.
LM Studio (GUI)
Download the installer from lmstudio.ai, open the app, and use the built-in search to find a model (for example "Llama 3.1 8B Instruct"). LM Studio recommends a quantization that fits your hardware, downloads it, and loads it — no terminal required.
Ollama (CLI)
Install Ollama, then run a single pull-and-chat command such as ollama run llama3.1.
Ollama downloads the model the first time and drops you straight into a chat prompt; the same
command later starts an instant local session.
llama.cpp (power users)
Clone and build llama.cpp, download a GGUF file from a model hub, and run the llama-cli
or llama-server binary pointed at the file. This gives you per-flag control over
threads, context length, and GPU offload layers.
The only step that needs the internet is the initial model download. After that you can disconnect entirely — pull your model files while online, then run them on a plane, in a SCIF, or on an air-gapped machine with zero connectivity.
4 Step 4: Run and Chat (CLI or GUI)
Once the model is loaded, you chat with it exactly like a cloud assistant — in LM Studio's chat window, in Ollama's terminal prompt, or through a local API on your own machine. The first response may take a moment as the model loads into memory; after that, replies stream token by token.
- GUI chat: in LM Studio, type into the chat box and adjust temperature, context length, and system prompt from the sidebar — no code needed.
-
CLI chat: with Ollama,
ollama run llama3.1opens an interactive prompt; type your message and press Enter to get a streamed reply. - Local API: both Ollama and LM Studio expose an OpenAI-compatible endpoint (typically on localhost), so existing apps and scripts can point at your local model by changing one base URL — no key, no cloud.
- Editor integration: developer tools can connect to that local endpoint for in-editor coding help. For a packaged, supported local coding assistant, see AirgapAI Code.
Expect roughly 5–15 tokens per second for a 7B model on a recent laptop CPU, and 40–100+ tokens per second on a dedicated GPU. If responses feel slow, that is almost always a sign the model is too large for your memory — drop to a smaller model or a more aggressive quantization.
5 Step 5: Add Your Own Documents (RAG Basics)
To make a local LLM answer from your own files, you use retrieval-augmented generation (RAG): your documents are split into chunks, converted to embeddings, stored in a local vector index, and the most relevant pieces are fed to the model with each question. This keeps everything offline while letting the model cite your PDFs, notes, and internal docs.
- Easiest path: LM Studio and front-ends like Open WebUI or AnythingLLM let you drag in documents and chat over them with a local model — no coding.
- Developer path: pair Ollama with a local vector database and an embedding model to build a custom RAG pipeline you fully control.
- The accuracy lever: RAG quality lives or dies on how cleanly your source text is prepared. Messy, duplicated, or poorly chunked documents cause hallucinations.
This is where data optimization matters most. Iternal's Blockify restructures raw documents into compact, deduplicated IdeaBlocks before they reach the vector database — an approach that delivers roughly 78X more accurate retrieval using about 3X fewer tokens, and works with any vector store. For local RAG that needs to be trustworthy, cleaning the data first is the highest-leverage step you can take.
Troubleshooting & Performance Tips
Most local-LLM problems trace back to one cause: the model is too big for your available memory. The fixes below resolve the overwhelming majority of slow, crashing, or out-of-memory sessions.
- Replies are very slow: the model is spilling out of RAM/VRAM. Use a smaller model or a lower-bit quantization (try 4-bit, or drop from 13B to 7B).
- Out-of-memory / crash on load: reduce the context length, close other apps, or pick a smaller quant. On GPUs, lower the number of offloaded layers.
- GPU not being used: confirm GPU offload is enabled in LM Studio's settings or set the GPU-layers flag in Ollama/llama.cpp; verify your drivers (CUDA/Metal) are current.
- Weak answers: raise the quantization (8-bit over 4-bit if memory allows), choose a larger or more recent model, or improve your system prompt and RAG document quality.
- Short, cut-off responses: increase the maximum output tokens and the context window in your tool's settings.