On This Page

Running LLMs Locally in 2026: A Step-by-Step Setup Guide for Ollama, llama.cpp, and vLLM

A hands-on guide to running LLMs locally in 2026: install Ollama, verify the API, then build llama.cpp and serve with vLLM, with the VRAM and bandwidth math behind each step.

Beginner45 min

Prerequisites

A GPU with 16GB+ VRAM or an Apple Silicon Mac
a terminal
command-line basics. Python 3.10+ only needed for the vLLM section.

Roei ZPublished Jun 27, 2026

Diagram of VRAM capacity holding a model plus KV cache, and memory bandwidth as data flow, feeding a model that runs locally on your hardware

Running a genuinely capable model on hardware you own is no longer a compromise. In 2026 you can run Google's multimodal Gemma 4 12B on a 16GB laptop, Qwen3.6-27B (a dense coding flagship that outscores last generation's much larger models) on a single consumer card, and a 70B-class model on a Mac with enough unified memory. This guide walks through three concrete setups, in order of increasing control: a five-minute Ollama install, a llama.cpp build for when you need to tune exactly how a model fits, and a vLLM server for when the model needs to back an application. By the end of each section you will have a model loaded, a request answered, and a way to verify it actually worked, not just installed.

If you understand the open-weights tradeoff and have read how quantization shrinks a model, this is the deployment half of the same story.

Before you start: the two numbers that decide everything

Local inference is governed by VRAM capacity and memory bandwidth. Capacity is whether the model fits. Bandwidth is how fast it generates. You need both numbers before picking a model, so check yours now.

Find your VRAM. On a machine with an NVIDIA GPU:

bash

nvidia-smi --query-gpu=name,memory.total --format=csv

name, memory.total [MiB]
NVIDIA GeForce RTX 4090, 24576 MiB

On a Mac, unified memory is your "VRAM": check Apple menu → About This Mac, or run sysctl hw.memsize and divide by 1e9 for GB.

Why bandwidth matters. Generating one token is memory-bandwidth bound: the GPU reads the model's weights out of memory to produce each token, and the arithmetic is trivial next to the data movement. So the speed ceiling is roughly:

tokens/sec  ≈  memory_bandwidth  /  bytes_read_per_token

For a dense model, bytes-read-per-token is close to the model's size in memory, because you stream essentially all the weights for every token. A 16GB model on a 1.79 TB/s RTX 5090 has a ceiling near 110 tok/s before efficiency losses; the same model on an Apple M4 Max at 546 GB/s tops out around a third of that. Neither has more or less "intelligence," they read the same weights, but the GPU moves them three times faster.

Mixture-of-experts models break the dense assumption in your favor. Qwen3.6-35B-A3B has 35B total parameters but activates only about 3B per token, so while you must hold all the experts in memory (a capacity cost), each token reads only the active subset (a bandwidth saving). That is why it generates roughly three to five times faster than the dense Qwen3.6-27B on identical hardware despite being larger overall.

Will your model fit? At 4-bit, the common local sweet spot:

Model	Total params	~Q4 weight size	Comfortable on
Gemma 4 12B (dense, multimodal)	12B	~7-8 GB	a 16GB laptop
gpt-oss-20b (MoE, MXFP4)	21B / ~3.6B active	~12-13 GB	a 16GB card
Qwen3.6-27B (dense)	27B	~15-17 GB	a 16-24GB card
Qwen3.6-35B-A3B (MoE)	35B / ~3B active	~20-22 GB	a 24GB card
70B-class (dense)	70B	~40 GB	dual 24GB, a 48GB card, or a big Mac
gpt-oss-120b (MoE, MXFP4)	~120B	~60 GB	one 80GB GPU or a 128GB Mac

The GGUF file size is a floor, not the full requirement: add KV cache for context length and runtime overhead, and leave a few GB of headroom. If your card is at the bottom edge of a row, drop to the row above.

The rest of this guide has you set up Gemma 4 12B if you have 16GB or less, or Qwen3.6-27B if you have 24GB+. Substitute any model from the table; the commands are identical.

Path 1: Ollama, the five-minute setup

Ollama wraps llama.cpp behind a one-command interface. Start here if you want a model running in the next five minutes.

Step 1: Install

bash

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

# macOS via Homebrew (alternative)
brew install ollama

On Windows, download the installer from ollama.com and run it. Verify the install before moving on:

bash

ollama --version

ollama version is 0.x.x

If this fails, the install script did not put ollama on your PATH; restart your terminal or check the install log for errors before continuing.

Step 2: Pull and run a model

bash

ollama run gemma4:12b

The first run downloads the weights (several GB, so this takes a few minutes depending on your connection), then drops you into an interactive chat:

>>> Send a message (/? for help)

Type a prompt and confirm you get a response:

>>> What is the capital of France?
Paris.

If you see a response, the model is loaded and running entirely on your hardware. Exit with /bye.

Step 3: Verify the API is reachable

Ollama also exposes an OpenAI-compatible endpoint at http://localhost:11434, automatically, with no extra config. Confirm it from a second terminal while a model is loaded:

bash

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma4:12b",
    "messages": [{"role": "user", "content": "Reply with one word: confirmed."}]
  }'

json

{
  "id": "chatcmpl-...",
  "choices": [{"message": {"role": "assistant", "content": "Confirmed."}}],
  "model": "gemma4:12b"
}

A JSON response with your model's name confirms the server is up and any OpenAI-client application code can point at http://localhost:11434/v1 with no other setup. For most single-machine use, this is the whole tutorial: stop here unless you need finer control over quantization, GPU offload, or concurrent-request throughput.

Quick troubleshooting: if ollama run hangs at "pulling manifest," check your network; if the chat loads but is extremely slow, run ollama ps in a second terminal to confirm the model shows 100% GPU under the processor column, not split with CPU.

Path 2: llama.cpp, for exact control over fit

Reach for llama.cpp directly when you want to choose the exact quantization, offload only some layers to a smaller GPU, or run partly on CPU. This path builds from source so you get the latest CUDA or Metal kernels.

Step 1: Build it

bash

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# NVIDIA GPU build
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j

# Apple Silicon build (Metal is on by default)
cmake -B build
cmake --build build --config Release -j

Verify the binary built and can see your GPU:

bash

./build/bin/llama-server --version

version: b...  (commit ...)

Step 2: Pull a specific quant and serve it

llama.cpp can pull a GGUF straight from Hugging Face and serve an OpenAI-compatible API:

bash

./build/bin/llama-server \
  -hf Qwen/Qwen3.6-27B-GGUF:Q4_K_M \
  --port 8080 \
  -ngl 99 \
  -c 8192

llama_model_loader: loaded meta data with ... key-value pairs
...
main: server is listening on http://127.0.0.1:8080

The two flags that matter most: -ngl (number of GPU layers) controls how much of the model lives on the GPU versus CPU. On a 16GB card running a model slightly too large, offload as many layers as fit (for example -ngl 30) and the rest run on CPU at a speed penalty; check the startup log for the line reporting how many layers it actually placed on GPU. -c sets the context length, which sizes the KV cache: as the effective-context story shows, the usable context is far shorter than the advertised window, so setting -c 8192 instead of the model's max costs you nothing real and reclaims gigabytes.

Step 3: Verify with a request

bash

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "Reply with one word: confirmed."}]}'

json

{"choices": [{"message": {"role": "assistant", "content": "Confirmed."}}]}

If the response comes back slowly (multiple seconds for a one-word answer on a model that should fit your GPU), check the server's startup log for how many layers landed on CPU versus GPU; a model split across both runs its CPU-resident layers at system-RAM bandwidth, an order of magnitude slower than VRAM.

Step 4: Tighten memory if context blew it up

If a long prompt causes an out-of-memory error after the model loaded fine on short prompts, quantize the KV cache instead of lowering -c further:

bash

./build/bin/llama-server \
  -hf Qwen/Qwen3.6-27B-GGUF:Q4_K_M \
  --port 8080 -ngl 99 -c 32768 \
  --cache-type-k q8_0 --cache-type-v q8_0

This roughly halves the KV cache's memory footprint with little quality cost.

Path 3: vLLM, for serving an application

Use this path when the model backs a service handling concurrent requests, not a single terminal session. vLLM's continuous batching keeps the GPU busy across many simultaneous users, where Ollama and llama.cpp serialize.

Step 1: Install

bash

python -m venv vllm-env
source vllm-env/bin/activate
pip install "vllm>=0.10.1"

Step 2: Serve a model

bash

vllm serve Qwen/Qwen3.6-27B --max-model-len 16384

INFO ... Starting vLLM API server on http://0.0.0.0:8000
INFO ... Application startup complete.

vLLM auto-detects common quantized checkpoints (AWQ, FP8, MXFP4); no extra flag needed for most pre-quantized models on Hugging Face.

Step 3: Verify with a concurrent load test

A single request looks identical to the other two paths:

bash

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "Qwen/Qwen3.6-27B", "messages": [{"role": "user", "content": "Reply with one word: confirmed."}]}'

The point of vLLM is what happens under concurrency, which is also the fastest way to confirm it is actually serving rather than just running. Fire 10 requests at once and confirm they all return in roughly the time of one:

bash

for i in $(seq 1 10); do
  curl -s http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"model": "Qwen/Qwen3.6-27B", "messages": [{"role": "user", "content": "Say ok."}]}' \
    -o /dev/null -w "%{time_total}s\n" &
done; wait

If the ten times are all close to a single request's latency rather than ten times longer, continuous batching is working. With Ollama or llama.cpp, the same test would show times stacking up roughly linearly because those servers process one request at a time per model instance.

Choosing a quantization

Quantization is what makes local viable, and the choice follows directly from the iso-memory principle in the quantization deep dive: a bigger model at 4-bit beats a smaller model at 8-bit for the same memory.

Q4_K_M is the default sweet spot: roughly a quarter of the full-precision size with minimal quality loss for most tasks. Used in every command above. Start here.
Q5_K_M or Q6_K if you have spare VRAM and want to close the small remaining gap, at proportionally more memory. Swap the tag in the -hf flag, for example Qwen/Qwen3.6-27B-GGUF:Q6_K.
Q8_0 is near-lossless but doubles the memory of Q4, which usually means dropping to a smaller model, the wrong trade.
MXFP4 is how gpt-oss ships natively, and current llama.cpp and Ollama have hardware-accelerated paths for it on Blackwell-class GPUs.

The mistake to avoid is spending memory on precision instead of parameters. If you can run a 30B at Q4 or a 14B at Q8 in the same VRAM, the 30B at Q4 is almost always the better model.

Gotchas and troubleshooting

Run through this list before assuming the setup itself failed; the four problems below cover nearly every local-inference complaint.

Symptom	Cause	Fix
Loads fine, OOMs only on long prompts	KV cache grew with context and competed with weights for memory	Lower `-c` / `--max-model-len`, or quantize the cache (`--cache-type-k q8_0 --cache-type-v q8_0` in llama.cpp)
Painfully slow generation despite "fitting"	Model spilled partly to CPU or shared memory	Check `-ngl` output (llama.cpp) or `ollama ps` (Ollama) for the GPU/CPU split; shrink the model or quant until it fully fits
Slow first token, fast after	Prompt processing (prefill) is compute-bound and scales with prompt length; separate from decode speed	Shorten the prompt, or enable flash attention (`-fa` in llama.cpp, on by default in vLLM)
Speed drops over a long session	Sustained inference heats a consumer GPU or laptop, throttling the clock	Improve cooling, or accept the laptop's sustained (not burst) tok/s as the real number

The real hardware bill

The three realistic local platforms make very different tradeoffs, and the right choice depends on whether you are optimizing for speed, capacity, or cost.

Platform	Memory	Bandwidth	Speed character	Best for
RTX 5090	32 GB	~1.79 TB/s	Fastest tok/s (≈186 on an 8B at Q4; less on bigger models)	Speed on models up to ~30B, tight 70B
RTX 4090	24 GB	~1 TB/s	Fast, ~30% behind the 5090	Cheaper entry; models up to ~30B
Apple M4 Max	up to 128 GB	~546 GB/s	Slower per token (≈70 on a 70B Q4)	Big models that no consumer GPU fits
Rented cloud GPU	any	any	Whatever you pay for	Bursty or occasional use, no capex

The instructive contrast is the RTX 5090 against the M4 Max. The 5090 has triple the bandwidth, so it generates tokens far faster, but its 32GB caps the model size: a 70B at Q4 is about 40GB and does not fit on one card. The M4 Max, with up to 128GB of unified memory, runs models a single GPU cannot hold, but its lower bandwidth means it generates those tokens at maybe 70 per second rather than the hundreds a 5090 manages on a model that fits.

Cloud rental is the honest baseline to always compare against. If you run inference a few hours a week, renting a GPU by the hour can be cheaper than buying one, and you skip the depreciation. Local hardware pays off when you run it heavily, when your data cannot leave the building, or when you need offline operation.

Local or API? The honest split

Running locally is not automatically the right call, which is the same open-weights pragmatism that runs through this whole site. Choose local when ownership actually matters:

Privacy and compliance: data that cannot go to a third-party API.
Cost at steady volume: high, predictable throughput where per-token API pricing adds up past the hardware cost.
Control and offline: no rate limits, no model deprecation under you, no network dependency.

Choose an API when the model is a component of a product whose value is the experience, not the weights: when you need frontier-tier quality that open models do not yet match, when you cannot justify the ops, or when traffic is bursty enough that you would rather rent capacity than own it.

Key Takeaways

Three concrete paths, in order of control. Ollama for a five-minute single-user setup, llama.cpp when you need to tune exact GPU offload and quantization, vLLM when the model serves an application under concurrent load.
Verify each step with a real request, not just a successful install. A curl to the OpenAI-compatible endpoint that returns the expected JSON is the only proof the server actually works.
Two numbers govern local inference. VRAM capacity decides whether a model fits; memory bandwidth decides how fast it runs. Most "why is it slow" problems are bandwidth, not capacity.
MoE models save bandwidth, not capacity. Qwen3.6-35B-A3B holds 35B parameters in memory but reads only ~3B per token, so it generates 3 to 5x faster than the dense Qwen3.6-27B while needing more VRAM to hold all the experts.
Q4_K_M is the default quant across every path in this guide: roughly a quarter the size with minimal quality loss. Spend spare memory on more parameters before more precision.
vLLM's advantage shows up under concurrency, not in a single request. The 10-parallel-request test is the fastest way to confirm continuous batching is actually working.
The 5090 is the speed play, the Mac is the capacity play. A 32GB RTX 5090 generates fastest but caps model size; a 128GB M4 Max runs 70B-class models a consumer GPU cannot fit, at lower tokens per second.
Local is a system decision, not a default. Run local for privacy, steady-volume cost, and control; use an API for frontier quality, low ops, and bursty traffic.

The Acing AI newsletter covers the practical edge of open models: what fits, what it costs, and where the benchmark and the invoice disagree. Subscribe for the grounded version.

Was this useful?

Quick, anonymous, no strings.

Running LLMs Locally in 2026: A Step-by-Step Setup Guide for Ollama, llama.cpp, and vLLM

Prerequisites

Before you start: the two numbers that decide everything

Path 1: Ollama, the five-minute setup

Step 1: Install

Step 2: Pull and run a model

Step 3: Verify the API is reachable

Path 2: llama.cpp, for exact control over fit

Step 1: Build it

Step 2: Pull a specific quant and serve it

Step 3: Verify with a request

Step 4: Tighten memory if context blew it up

Path 3: vLLM, for serving an application

Step 1: Install

Step 2: Serve a model

Step 3: Verify with a concurrent load test

Choosing a quantization

Gotchas and troubleshooting

The real hardware bill

Local or API? The honest split

Key Takeaways

Read Next

Agent Memory Beyond RAG: Why Your Agent Needs a Write Path, Not a Retriever

Agent Evaluation for Tool Use: Why pass@1 Lies, and How to Measure Reliability

KV-Cache Engineering: The Memory Wall of LLM Serving