On This Page
Running LLMs Locally in 2026: A Step-by-Step Setup Guide for Ollama, llama.cpp, and vLLM
A hands-on guide to running LLMs locally in 2026: install Ollama, verify the API, then build llama.cpp and serve with vLLM, with the VRAM and bandwidth math behind each step.

Running a genuinely capable model on hardware you own is no longer a compromise. In 2026 you can run Google's multimodal Gemma 4 12B on a 16GB laptop, Qwen3.6-27B (a dense coding flagship that outscores last generation's much larger models) on a single consumer card, and a 70B-class model on a Mac with enough unified memory. This guide walks through three concrete setups, in order of increasing control: a five-minute Ollama install, a llama.cpp build for when you need to tune exactly how a model fits, and a vLLM server for when the model needs to back an application. By the end of each section you will have a model loaded, a request answered, and a way to verify it actually worked, not just installed.
If you understand the open-weights tradeoff and have read how quantization shrinks a model, this is the deployment half of the same story.
Before you start: the two numbers that decide everything
Local inference is governed by VRAM capacity and memory bandwidth. Capacity is whether the model fits. Bandwidth is how fast it generates. You need both numbers before picking a model, so check yours now.
Find your VRAM. On a machine with an NVIDIA GPU:
nvidia-smi --query-gpu=name,memory.total --format=csvname, memory.total [MiB]
NVIDIA GeForce RTX 4090, 24576 MiBOn a Mac, unified memory is your "VRAM": check Apple menu → About This Mac, or run sysctl hw.memsize and divide by 1e9 for GB.
Why bandwidth matters. Generating one token is memory-bandwidth bound: the GPU reads the model's weights out of memory to produce each token, and the arithmetic is trivial next to the data movement. So the speed ceiling is roughly:
tokens/sec ≈ memory_bandwidth / bytes_read_per_tokenFor a dense model, bytes-read-per-token is close to the model's size in memory, because you stream essentially all the weights for every token. A 16GB model on a 1.79 TB/s RTX 5090 has a ceiling near 110 tok/s before efficiency losses; the same model on an Apple M4 Max at 546 GB/s tops out around a third of that. Neither has more or less "intelligence," they read the same weights, but the GPU moves them three times faster.
Mixture-of-experts models break the dense assumption in your favor. Qwen3.6-35B-A3B has 35B total parameters but activates only about 3B per token, so while you must hold all the experts in memory (a capacity cost), each token reads only the active subset (a bandwidth saving). That is why it generates roughly three to five times faster than the dense Qwen3.6-27B on identical hardware despite being larger overall.
Will your model fit? At 4-bit, the common local sweet spot:
| Model | Total params | ~Q4 weight size | Comfortable on |
|---|---|---|---|
| Gemma 4 12B (dense, multimodal) | 12B | ~7-8 GB | a 16GB laptop |
| gpt-oss-20b (MoE, MXFP4) | 21B / ~3.6B active | ~12-13 GB | a 16GB card |
| Qwen3.6-27B (dense) | 27B | ~15-17 GB | a 16-24GB card |
| Qwen3.6-35B-A3B (MoE) | 35B / ~3B active | ~20-22 GB | a 24GB card |
| 70B-class (dense) | 70B | ~40 GB | dual 24GB, a 48GB card, or a big Mac |
| gpt-oss-120b (MoE, MXFP4) | ~120B | ~60 GB | one 80GB GPU or a 128GB Mac |
The GGUF file size is a floor, not the full requirement: add KV cache for context length and runtime overhead, and leave a few GB of headroom. If your card is at the bottom edge of a row, drop to the row above.
The rest of this guide has you set up Gemma 4 12B if you have 16GB or less, or Qwen3.6-27B if you have 24GB+. Substitute any model from the table; the commands are identical.
Path 1: Ollama, the five-minute setup
Ollama wraps llama.cpp behind a one-command interface. Start here if you want a model running in the next five minutes.
Step 1: Install
# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh
# macOS via Homebrew (alternative)
brew install ollamaOn Windows, download the installer from ollama.com and run it. Verify the install before moving on:
ollama --versionollama version is 0.x.xIf this fails, the install script did not put ollama on your PATH; restart your terminal or check the install log for errors before continuing.
Step 2: Pull and run a model
ollama run gemma4:12bThe first run downloads the weights (several GB, so this takes a few minutes depending on your connection), then drops you into an interactive chat:
>>> Send a message (/? for help)Type a prompt and confirm you get a response:
>>> What is the capital of France?
Paris.If you see a response, the model is loaded and running entirely on your hardware. Exit with /bye.
Step 3: Verify the API is reachable
Ollama also exposes an OpenAI-compatible endpoint at http://localhost:11434, automatically, with no extra config. Confirm it from a second terminal while a model is loaded:
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma4:12b",
"messages": [{"role": "user", "content": "Reply with one word: confirmed."}]
}'{
"id": "chatcmpl-...",
"choices": [{"message": {"role": "assistant", "content": "Confirmed."}}],
"model": "gemma4:12b"
}A JSON response with your model's name confirms the server is up and any OpenAI-client application code can point at http://localhost:11434/v1 with no other setup. For most single-machine use, this is the whole tutorial: stop here unless you need finer control over quantization, GPU offload, or concurrent-request throughput.
Quick troubleshooting: if ollama run hangs at "pulling manifest," check your network; if the chat loads but is extremely slow, run ollama ps in a second terminal to confirm the model shows 100% GPU under the processor column, not split with CPU.
Path 2: llama.cpp, for exact control over fit
Reach for llama.cpp directly when you want to choose the exact quantization, offload only some layers to a smaller GPU, or run partly on CPU. This path builds from source so you get the latest CUDA or Metal kernels.
Step 1: Build it
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# NVIDIA GPU build
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j
# Apple Silicon build (Metal is on by default)
cmake -B build
cmake --build build --config Release -jVerify the binary built and can see your GPU:
./build/bin/llama-server --versionversion: b... (commit ...)Step 2: Pull a specific quant and serve it
llama.cpp can pull a GGUF straight from Hugging Face and serve an OpenAI-compatible API:
./build/bin/llama-server \
-hf Qwen/Qwen3.6-27B-GGUF:Q4_K_M \
--port 8080 \
-ngl 99 \
-c 8192llama_model_loader: loaded meta data with ... key-value pairs
...
main: server is listening on http://127.0.0.1:8080The two flags that matter most: -ngl (number of GPU layers) controls how much of the model lives on the GPU versus CPU. On a 16GB card running a model slightly too large, offload as many layers as fit (for example -ngl 30) and the rest run on CPU at a speed penalty; check the startup log for the line reporting how many layers it actually placed on GPU. -c sets the context length, which sizes the KV cache: as the effective-context story shows, the usable context is far shorter than the advertised window, so setting -c 8192 instead of the model's max costs you nothing real and reclaims gigabytes.
Step 3: Verify with a request
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "user", "content": "Reply with one word: confirmed."}]}'{"choices": [{"message": {"role": "assistant", "content": "Confirmed."}}]}If the response comes back slowly (multiple seconds for a one-word answer on a model that should fit your GPU), check the server's startup log for how many layers landed on CPU versus GPU; a model split across both runs its CPU-resident layers at system-RAM bandwidth, an order of magnitude slower than VRAM.
Step 4: Tighten memory if context blew it up
If a long prompt causes an out-of-memory error after the model loaded fine on short prompts, quantize the KV cache instead of lowering -c further:
./build/bin/llama-server \
-hf Qwen/Qwen3.6-27B-GGUF:Q4_K_M \
--port 8080 -ngl 99 -c 32768 \
--cache-type-k q8_0 --cache-type-v q8_0This roughly halves the KV cache's memory footprint with little quality cost.
Path 3: vLLM, for serving an application
Use this path when the model backs a service handling concurrent requests, not a single terminal session. vLLM's continuous batching keeps the GPU busy across many simultaneous users, where Ollama and llama.cpp serialize.
Step 1: Install
python -m venv vllm-env
source vllm-env/bin/activate
pip install "vllm>=0.10.1"Step 2: Serve a model
vllm serve Qwen/Qwen3.6-27B --max-model-len 16384INFO ... Starting vLLM API server on http://0.0.0.0:8000
INFO ... Application startup complete.vLLM auto-detects common quantized checkpoints (AWQ, FP8, MXFP4); no extra flag needed for most pre-quantized models on Hugging Face.
Step 3: Verify with a concurrent load test
A single request looks identical to the other two paths:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "Qwen/Qwen3.6-27B", "messages": [{"role": "user", "content": "Reply with one word: confirmed."}]}'The point of vLLM is what happens under concurrency, which is also the fastest way to confirm it is actually serving rather than just running. Fire 10 requests at once and confirm they all return in roughly the time of one:
for i in $(seq 1 10); do
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "Qwen/Qwen3.6-27B", "messages": [{"role": "user", "content": "Say ok."}]}' \
-o /dev/null -w "%{time_total}s\n" &
done; waitIf the ten times are all close to a single request's latency rather than ten times longer, continuous batching is working. With Ollama or llama.cpp, the same test would show times stacking up roughly linearly because those servers process one request at a time per model instance.
Choosing a quantization
Quantization is what makes local viable, and the choice follows directly from the iso-memory principle in the quantization deep dive: a bigger model at 4-bit beats a smaller model at 8-bit for the same memory.
- Q4_K_M is the default sweet spot: roughly a quarter of the full-precision size with minimal quality loss for most tasks. Used in every command above. Start here.
- Q5_K_M or Q6_K if you have spare VRAM and want to close the small remaining gap, at proportionally more memory. Swap the tag in the
-hfflag, for exampleQwen/Qwen3.6-27B-GGUF:Q6_K. - Q8_0 is near-lossless but doubles the memory of Q4, which usually means dropping to a smaller model, the wrong trade.
- MXFP4 is how gpt-oss ships natively, and current llama.cpp and Ollama have hardware-accelerated paths for it on Blackwell-class GPUs.
The mistake to avoid is spending memory on precision instead of parameters. If you can run a 30B at Q4 or a 14B at Q8 in the same VRAM, the 30B at Q4 is almost always the better model.
Gotchas and troubleshooting
Run through this list before assuming the setup itself failed; the four problems below cover nearly every local-inference complaint.
| Symptom | Cause | Fix |
|---|---|---|
| Loads fine, OOMs only on long prompts | KV cache grew with context and competed with weights for memory | Lower -c / --max-model-len, or quantize the cache (--cache-type-k q8_0 --cache-type-v q8_0 in llama.cpp) |
| Painfully slow generation despite "fitting" | Model spilled partly to CPU or shared memory | Check -ngl output (llama.cpp) or ollama ps (Ollama) for the GPU/CPU split; shrink the model or quant until it fully fits |
| Slow first token, fast after | Prompt processing (prefill) is compute-bound and scales with prompt length; separate from decode speed | Shorten the prompt, or enable flash attention (-fa in llama.cpp, on by default in vLLM) |
| Speed drops over a long session | Sustained inference heats a consumer GPU or laptop, throttling the clock | Improve cooling, or accept the laptop's sustained (not burst) tok/s as the real number |
The real hardware bill
The three realistic local platforms make very different tradeoffs, and the right choice depends on whether you are optimizing for speed, capacity, or cost.
| Platform | Memory | Bandwidth | Speed character | Best for |
|---|---|---|---|---|
| RTX 5090 | 32 GB | ~1.79 TB/s | Fastest tok/s (≈186 on an 8B at Q4; less on bigger models) | Speed on models up to ~30B, tight 70B |
| RTX 4090 | 24 GB | ~1 TB/s | Fast, ~30% behind the 5090 | Cheaper entry; models up to ~30B |
| Apple M4 Max | up to 128 GB | ~546 GB/s | Slower per token (≈70 on a 70B Q4) | Big models that no consumer GPU fits |
| Rented cloud GPU | any | any | Whatever you pay for | Bursty or occasional use, no capex |
The instructive contrast is the RTX 5090 against the M4 Max. The 5090 has triple the bandwidth, so it generates tokens far faster, but its 32GB caps the model size: a 70B at Q4 is about 40GB and does not fit on one card. The M4 Max, with up to 128GB of unified memory, runs models a single GPU cannot hold, but its lower bandwidth means it generates those tokens at maybe 70 per second rather than the hundreds a 5090 manages on a model that fits.
Cloud rental is the honest baseline to always compare against. If you run inference a few hours a week, renting a GPU by the hour can be cheaper than buying one, and you skip the depreciation. Local hardware pays off when you run it heavily, when your data cannot leave the building, or when you need offline operation.
Local or API? The honest split
Running locally is not automatically the right call, which is the same open-weights pragmatism that runs through this whole site. Choose local when ownership actually matters:
- Privacy and compliance: data that cannot go to a third-party API.
- Cost at steady volume: high, predictable throughput where per-token API pricing adds up past the hardware cost.
- Control and offline: no rate limits, no model deprecation under you, no network dependency.
Choose an API when the model is a component of a product whose value is the experience, not the weights: when you need frontier-tier quality that open models do not yet match, when you cannot justify the ops, or when traffic is bursty enough that you would rather rent capacity than own it.
Key Takeaways
- Three concrete paths, in order of control. Ollama for a five-minute single-user setup, llama.cpp when you need to tune exact GPU offload and quantization, vLLM when the model serves an application under concurrent load.
- Verify each step with a real request, not just a successful install. A
curlto the OpenAI-compatible endpoint that returns the expected JSON is the only proof the server actually works. - Two numbers govern local inference. VRAM capacity decides whether a model fits; memory bandwidth decides how fast it runs. Most "why is it slow" problems are bandwidth, not capacity.
- MoE models save bandwidth, not capacity. Qwen3.6-35B-A3B holds 35B parameters in memory but reads only ~3B per token, so it generates 3 to 5x faster than the dense Qwen3.6-27B while needing more VRAM to hold all the experts.
- Q4_K_M is the default quant across every path in this guide: roughly a quarter the size with minimal quality loss. Spend spare memory on more parameters before more precision.
- vLLM's advantage shows up under concurrency, not in a single request. The 10-parallel-request test is the fastest way to confirm continuous batching is actually working.
- The 5090 is the speed play, the Mac is the capacity play. A 32GB RTX 5090 generates fastest but caps model size; a 128GB M4 Max runs 70B-class models a consumer GPU cannot fit, at lower tokens per second.
- Local is a system decision, not a default. Run local for privacy, steady-volume cost, and control; use an API for frontier quality, low ops, and bursty traffic.
The Acing AI newsletter covers the practical edge of open models: what fits, what it costs, and where the benchmark and the invoice disagree. Subscribe for the grounded version.