Ollama Complete Guide: Run Large Language Models Locally — From Zero to Production
Audience: Developers who want to self-host LLMs, AI enthusiasts looking to escape API costs, privacy-conscious teams building internal tooling
Difficulty: ⭐⭐ (basic terminal familiarity is enough)
Time to complete: 30–45 minutes
Why This Guide Exists
If you have spent any time building with LLMs in 2025–2026, you have probably hit at least one of these walls:
- Your API bill from OpenAI or Anthropic just crossed three figures and you are still in prototype mode
- Your company's security review flagged sending code snippets to third-party APIs
- You are building something that needs to work offline or in air-gapped environments
- You want to experiment with model behavior, prompt engineering, or fine-tuning — and closed models give you zero visibility
Ollama solves all of these. It packages the entire workflow — downloading, running, managing, and serving LLMs locally — into a single binary with a clean CLI and an OpenAI-compatible API.
This guide goes beyond the official docs. It covers the things you will actually run into when you move from "it works on my machine" to "this is running in my daily workflow."
Part 1: What Ollama Actually Is
Ollama is a local LLM runtime. At its core, it does three things:
- Downloads and manages models from a curated library (or from your own GGUF files)
- Serves those models over a local HTTP API (port 11434 by default)
- Provides a CLI that wraps the API in a convenient interactive shell
What it is not: it is not a model trainer, not a fine-tuning framework, and not a hosted service. It is a runtime — think of it as node or python but for LLMs.
Key facts at a glance
| Feature | Detail |
|---|---|
| License | MIT (open source) |
| Supported OS | macOS, Windows, Linux |
| Supported architectures | x86_64, ARM64 (Apple Silicon native) |
| API compatibility | OpenAI API (drop-in replacement for local use) |
| Model format | GGUF (via bundled llama.cpp) |
| Model library size | 100+ models, actively maintained |
| GPU support | NVIDIA (CUDA), AMD (ROCm on Linux), Apple Silicon (Metal) |
Part 2: Installation
2.1 Before You Install: Hardware Reality Check
The single most common mistake beginners make is downloading a model that does not fit in memory. Here is the practical guidance:
| Model Size | Minimum RAM | GPU VRAM for decent speed | Will it run? |
|---|---|---|---|
| 1B–3B | 4 GB | None (CPU is fine) | ✅ Fast |
| 7B–8B | 8 GB | 6 GB | ✅ Usable |
| 14B | 16 GB | 12 GB | ⚠️ Noticeable lag on CPU |
| 30B+ | 32 GB+ | 24 GB+ | ⚠️ Needs quantization |
2.2 Installing on macOS
Recommended method — download the app:- Go to ollama.com and click Download
- Download the macOS zip file
- Unzip and drag Ollama to Applications
- On first launch, macOS will ask for permission — approve it in System Settings → Privacy & Security
- Ollama runs as a menu bar app (the llama icon). You interact with it via Terminal.
brew install ollama
Verify:
ollama --version
Apple Silicon note: Ollama on M1/M2/M3 Macs is exceptionally well optimized. The Metal backend means models run on the GPU portion of the SoC, and unified memory means you are not constrained by separate VRAM. An M2 MacBook Air with 16 GB RAM can run qwen2.5:14b at usable speed — try doing that on an Intel laptop.
2.3 Installing on Windows
- Download
OllamaSetup.exefrom ollama.com - Run the installer (administrator privileges required)
- After installation, Ollama runs as a system tray application
- Open PowerShell or Command Prompt and verify:
ollama --version
Windows version requirement: Windows 10 22H2 or later is required. If you are on 21H1 or earlier, the terminal will display control character garbage (←[?25h←[?25l). Update Windows before proceeding.
WSL2 users: You can run Ollama inside WSL2, but the native Windows version is better integrated. If you do run it in WSL2 and want GPU access, you need CUDA support in WSL2 set up separately.
2.4 Installing on Linux
The officially recommended one-liner:
curl -fsSL https://ollama.com/install.sh | sh
The script detects your distribution and installs the appropriate package. After installation, Ollama runs as a systemd service:
sudo systemctl status ollama
sudo systemctl start ollama # if not already running
sudo systemctl enable ollama # start on boot
GPU support on Linux: The install script does not install GPU drivers. For NVIDIA GPUs, install the proprietary driver separately before running Ollama. For AMD GPUs, you need ROCm 7+ (see the troubleshooting section below).
Part 3: Running Your First Model
3.1 The Simplest Possible Start
ollama run deepseek-r1:7b
First run will download the model (~4–5 GB). Subsequent runs are instant.
Once downloaded, you will see a >>> prompt. This is the interactive REPL:
>>> What is the difference between a list and a tuple in Python?
Type /bye to exit.
3.2 Which Model Should You Actually Use?
The Ollama model library has 100+ entries. Here is the practical shortlist for 2026:
| Model | Best version | When to use it |
|---|---|---|
| DeepSeek-R1 | 7b, 14b | Coding, math, reasoning. The current open-weight champion. |
| Qwen2.5 | 7b, 14b, 32b | Chinese-language tasks. Also strong on code. |
| Llama 3.2 / 3.3 | 8b, 11b | General purpose. Best ecosystem support. |
| Gemma 2 | 9b | Fast inference, small footprint. Google's best open model. |
| Mistral / Mixtral | 7b, 8x7b | Good balance of speed and capability. European provenance. |
| Phi-3 | mini, medium | Microsoft's small model. Surprisingly capable for the size. |
# This opens the library in your browser
start https://ollama.com/library # Windows
open https://ollama.com/library # macOS
xdg-open https://ollama.com/library # Linux
3.3 Essential CLI Commands
ollama run <name> # Pull (if needed) and start interactive session
ollama pull <name> # Download model without starting session
ollama list # Show locally downloaded models and their sizes
ollama show <name> # Show model metadata (size, template, parameters)
ollama rm <name> # Delete model from disk
ollama cp <src> <dst> # Duplicate a model (useful before customizing)
ollama push <name> # Push a custom model to a registry (advanced)
Part 4: Customizing Model Behavior with Modelfile
Ollama uses a Modelfile (analogous to a Dockerfile) to define custom model configurations.
4.1 A Practical Example
Suppose you want a version of DeepSeek-R1 that:
- Always responds in English
- Has a lower temperature (more deterministic output)
- Has a longer context window
Create a file called Modelfile:
FROM deepseek-r1:7b
SYSTEM """
You are a senior software engineer. You explain technical concepts clearly,
use concrete examples, and always include working code snippets.
Respond in English unless the user explicitly asks for another language.
"""
PARAMETER temperature 0.2
PARAMETER num_ctx 8192
PARAMETER top_p 0.9
Build it:
ollama create my-deepseek -f Modelfile
Run it:
ollama run my-deepseek
4.2 Parameters That Matter
Not all parameters in the docs are equally useful. Here are the ones that actually affect output quality:
| Parameter | What it does | Good value for |
|---|---|---|
temperature | Controls randomness (0 = deterministic, 1 = creative) | 0.2–0.4 for code/QA; 0.7–0.9 for creative writing |
num_ctx | Context window size (tokens) | 4096 (default); 8192+ for long documents |
num_predict | Max tokens to generate | 2048 (default); increase for long-form output |
repeat_penalty | Penalizes repetitive output | 1.1 (default); increase to 1.2–1.3 if the model loops |
top_p | Nucleus sampling threshold | 0.9 (default); lower for more focused output |
stop | Stop sequences (list) | [ "", "###" ] for structured output |
Part 5: Using Ollama as an API Server
This is where Ollama becomes useful for real applications. When Ollama is running, it automatically serves an HTTP API on http://localhost:11434.
5.1 The Generate Endpoint
curl http://localhost:11434/api/generate -d '{
"model": "deepseek-r1:7b",
"prompt": "Write a Python function to detect cycle in a linked list",
"stream": false,
"options": {
"temperature": 0.3
}
}'
Setting "stream": false returns the full response as a single JSON object. For streaming (better UX in chat apps), omit it or set it to true.
5.2 The Chat Endpoint (OpenAI-compatible)
curl http://localhost:11434/api/chat -d '{
"model": "deepseek-r1:7b",
"messages": [
{ "role": "user", "content": "What is dependency injection?" }
],
"stream": false
}'
5.3 Using the Python Library
Ollama provides an official Python package:
pip install ollama
import ollama
# Simple generation
response = ollama.generate(
model='deepseek-r1:7b',
prompt='Explain the CAP theorem in one paragraph'
)
print(response['response'])
# Chat with conversation history
conversation = [
{'role': 'system', 'content': 'You are a helpful coding tutor.'},
{'role': 'user', 'content': 'What is a decorator in Python?'},
]
response = ollama.chat(model='deepseek-r1:7b', messages=conversation)
print(response['message']['content'])
# Streaming response (for real-time display)
stream = ollama.generate(
model='deepseek-r1:7b',
prompt='Count from 1 to 10',
stream=True
)
for chunk in stream:
print(chunk['response'], end='', flush=True)
5.4 Drop-in OpenAI Compatibility
If you have existing code that uses the OpenAI Python SDK, you can point it at Ollama with a one-line change:
from openai import OpenAI
client = OpenAI(
base_url='http://localhost:11434/v1',
api_key='not-needed' # Ollama ignores this field locally
)
response = client.chat.completions.create(
model='deepseek-r1:7b',
messages=[{'role': 'user', 'content': 'Hello'}],
temperature=0.3
)
print(response.choices[0].message.content)
This means you can develop against OpenAI's API and switch to local models for production — or vice versa — without rewriting your application logic.
Part 6: Troubleshooting (The Stuff Docs Do Not Tell You)
Problem 1: Model download stalls or fails repeatedly
Why it happens: Ollama pulls models from hosted registries. In some regions, connectivity is inconsistent. Fixes:- Update Ollama to the latest version (
ollama --versionand check against ollama.com) - Run
ollama pullagain — it supports resumable downloads - If a specific model version consistently fails, try a different quantization (e.g.,
deepseek-r1:7b-q4_0instead ofdeepseek-r1:7b) - Check disk space. Models are large.
ollama listshows sizes.
Problem 2: Response is extremely slow
Why it happens: The model is running on CPU because GPU was not detected, OR the model is too large for available memory and the system is swapping. How to diagnose:# Check if GPU is being used (Linux/macOS)
ollama show <model> # Look for "gpu_layers" info
# Check memory usage while model is loading
# macOS: Activity Monitor → Memory
# Windows: Task Manager → Performance → Memory
Fixes:
- Switch to a smaller model
- Use a quantized version (
q4_0orq5_0in the model tag) - On Linux, verify GPU drivers are loaded (
nvidia-smifor NVIDIA) - On macOS with Apple Silicon, make sure you are not running under Rosetta (check Activity Monitor → Kind column)
Problem 3: GPU not detected on Linux with NVIDIA
This is the most common Linux issue. Step-by-step diagnosis:
# Step 1: Check if NVIDIA driver is working
nvidia-smi
# If this fails, the driver is not installed correctly
# Step 2: Check if Ollama service can see the GPU
journalctl -u ollama --no-pager | grep -i gpu
# Step 3: Check Ollama logs for GPU discovery
journalctl -u ollama --no-pager --follow
# Then start a model and watch the logs
Common fix: The nvidia-container-toolkit package is missing (if running in Docker), or the driver version is too old. Ollama requires NVIDIA driver 525+ for CUDA 12 support.
Problem 4: GPU not detected on Linux with AMD
Ollama on Linux with AMD GPUs requires ROCm 7+. If your GPU is not detected:
# Check if ROCm is installed
rocminfo
# If rocminfo is not found, install ROCm 7 from AMD's official docs:
# https://rocm.docs.amd.com/projects/install-on-linux/en/latest/
If ROCm is installed but Ollama still uses CPU, check the Ollama logs:
journalctl -u ollama | grep -i "amd\|rocm\|gpu"
Problem 5: Chinese/non-English output quality is poor
Why it happens: Not all models are multilingual. Llama 3 (the base model) is primarily trained on English. For Chinese, Japanese, or Korean, use:- Qwen2.5 (best for Chinese)
- DeepSeek-R1 (strong multilingual)
- Gemma 2 with instruction tuning
Also make sure your prompt itself is in the target language — models follow the language of the prompt.
Problem 6: connection refused on localhost:11434
Why it happens: Ollama is not running, or it is running on a different port.
Fix:
- macOS/Windows: Check menu bar / system tray for the Ollama icon. If it is not there, relaunch the app.
- Linux:
sudo systemctl status ollama - If the service is running but the port is wrong:
lsof -i :11434(macOS/Linux) ornetstat -ano | findstr 11434(Windows)
Problem 7: Running out of disk space
Models are stored in ~/.ollama/models. A 7B model takes ~4–5 GB, a 14B model ~9 GB, and they add up.
# See all models and their sizes
ollama list
# Remove models you no longer need
ollama rm <model-name>
Part 7: Making It Actually Useful — Integrations
Ollama by itself is a CLI tool. The real power comes from connecting it to tools you already use.
7.1 Open WebUI (the "local ChatGPT" experience)
The single most impactful integration. Open WebUI gives you a ChatGPT-like web interface that connects to Ollama:
docker run -d -p 8080:8080 --name open-webui \
--add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data \
ghcr.io/open-webui/open-webui:main
Then open http://localhost:8080. You get:
- Chat interface with history
- Support for multiple models
- Document upload (RAG)
- Image generation (if you have a vision model)
- User management (if you want to share it with a team)
7.2 Continue (VS Code integration)
Install the Continue extension in VS Code, then configure it to use Ollama:
{
"models": [
{
"title": "Ollama DeepSeek",
"provider": "ollama",
"model": "deepseek-r1:7b"
}
]
}
Now you have local AI code completion and chat directly in VS Code.
7.3 LangChain Integration
from langchain_ollama import OllamaLLM
from langchain_core.prompts import ChatPromptTemplate
llm = OllamaLLM(model="deepseek-r1:7b", temperature=0.3)
prompt = ChatPromptTemplate.from_template(
"Write a {language} function that {task}"
)
chain = prompt | llm
result = chain.invoke({
"language": "Python",
"task": "reads a CSV file and prints the first 5 rows"
})
print(result)
Part 8: Serving Beyond localhost
By default, Ollama binds to 127.0.0.1 (localhost only). To expose it to your local network:
# Linux/macOS
export OLLAMA_HOST=0.0.0.0:11434
ollama serve
# Windows PowerShell
$env:OLLAMA_HOST="0.0.0.0:11434"
ollama serve
Security warning: Only do this on trusted networks. There is no authentication on the Ollama API. Anyone on your network who can reach port 11434 can use your models and consume your compute. For production use, put Ollama behind a reverse proxy with authentication (e.g., nginx + OAuth).
Part 9: Where to Go From Here
You now have a working local LLM setup. Here are the highest-leverage next steps:
- Build a local RAG system — combine Ollama with a vector database (Chroma, Qdrant) to chat with your own documents
- Set up a team instance — run Ollama on a server with a good GPU and let your team access it via Open WebUI
- Experiment with model routing — use a small/fast model for simple queries and route complex queries to a larger model
- Explore the Ollama ecosystem — tools like OpenClaw, n8n, and Continue all have Ollama integrations
- Ollama docs: https://ollama.com/docs
- Model library: https://ollama.com/library
- GitHub: https://github.com/ollama/ollama
- Discord: https://discord.gg/ollama
- Open WebUI: https://openwebui.com
*This guide is part of the repohot.com open-source AI tools tutorial series. Discover more tools and tutorials at repohot.com.*