Ollama Complete Guide: Run Large Language Models Locally — From Zero to Production

2026-06-17 · AI Local Deployment

Audience: Developers who want to self-host LLMs, AI enthusiasts looking to escape API costs, privacy-conscious teams building internal tooling

Difficulty: ⭐⭐ (basic terminal familiarity is enough)

Time to complete: 30–45 minutes

Why This Guide Exists

If you have spent any time building with LLMs in 2025–2026, you have probably hit at least one of these walls:

Your API bill from OpenAI or Anthropic just crossed three figures and you are still in prototype mode
Your company's security review flagged sending code snippets to third-party APIs
You are building something that needs to work offline or in air-gapped environments
You want to experiment with model behavior, prompt engineering, or fine-tuning — and closed models give you zero visibility

Ollama solves all of these. It packages the entire workflow — downloading, running, managing, and serving LLMs locally — into a single binary with a clean CLI and an OpenAI-compatible API.

This guide goes beyond the official docs. It covers the things you will actually run into when you move from "it works on my machine" to "this is running in my daily workflow."

Part 1: What Ollama Actually Is

Ollama is a local LLM runtime. At its core, it does three things:

Downloads and manages models from a curated library (or from your own GGUF files)
Serves those models over a local HTTP API (port 11434 by default)
Provides a CLI that wraps the API in a convenient interactive shell

What it is not: it is not a model trainer, not a fine-tuning framework, and not a hosted service. It is a runtime — think of it as node or python but for LLMs.

Key facts at a glance

Feature	Detail
License	MIT (open source)
Supported OS	macOS, Windows, Linux
Supported architectures	x86_64, ARM64 (Apple Silicon native)
API compatibility	OpenAI API (drop-in replacement for local use)
Model format	GGUF (via bundled llama.cpp)
Model library size	100+ models, actively maintained
GPU support	NVIDIA (CUDA), AMD (ROCm on Linux), Apple Silicon (Metal)

Part 2: Installation

2.1 Before You Install: Hardware Reality Check

The single most common mistake beginners make is downloading a model that does not fit in memory. Here is the practical guidance:

Model Size	Minimum RAM	GPU VRAM for decent speed	Will it run?
1B–3B	4 GB	None (CPU is fine)	✅ Fast
7B–8B	8 GB	6 GB	✅ Usable
14B	16 GB	12 GB	⚠️ Noticeable lag on CPU
30B+	32 GB+	24 GB+	⚠️ Needs quantization

Rule of thumb: if you are on a MacBook with 8 GB RAM, stick to 7B models. If you have 16 GB, you can run 14B comfortably. If you have an M-series Mac with unified memory, the rules are more generous — a 24 GB M-chip Mac can run 30B models entirely in RAM with decent performance.

2.2 Installing on macOS

Recommended method — download the app:

Go to ollama.com and click Download
Download the macOS zip file
Unzip and drag Ollama to Applications
On first launch, macOS will ask for permission — approve it in System Settings → Privacy & Security
Ollama runs as a menu bar app (the llama icon). You interact with it via Terminal.

Alternative — Homebrew:

brew install ollama

Verify:

ollama --version

Apple Silicon note: Ollama on M1/M2/M3 Macs is exceptionally well optimized. The Metal backend means models run on the GPU portion of the SoC, and unified memory means you are not constrained by separate VRAM. An M2 MacBook Air with 16 GB RAM can run qwen2.5:14b at usable speed — try doing that on an Intel laptop.

2.3 Installing on Windows

Download OllamaSetup.exe from ollama.com
Run the installer (administrator privileges required)
After installation, Ollama runs as a system tray application
Open PowerShell or Command Prompt and verify:

ollama --version

Windows version requirement: Windows 10 22H2 or later is required. If you are on 21H1 or earlier, the terminal will display control character garbage (←[?25h←[?25l). Update Windows before proceeding.

WSL2 users: You can run Ollama inside WSL2, but the native Windows version is better integrated. If you do run it in WSL2 and want GPU access, you need CUDA support in WSL2 set up separately.

2.4 Installing on Linux

The officially recommended one-liner:

curl -fsSL https://ollama.com/install.sh | sh

The script detects your distribution and installs the appropriate package. After installation, Ollama runs as a systemd service:

sudo systemctl status ollama
sudo systemctl start ollama    # if not already running
sudo systemctl enable ollama   # start on boot

GPU support on Linux: The install script does not install GPU drivers. For NVIDIA GPUs, install the proprietary driver separately before running Ollama. For AMD GPUs, you need ROCm 7+ (see the troubleshooting section below).

Part 3: Running Your First Model

3.1 The Simplest Possible Start

ollama run deepseek-r1:7b

First run will download the model (~4–5 GB). Subsequent runs are instant.

Once downloaded, you will see a >>> prompt. This is the interactive REPL:

>>> What is the difference between a list and a tuple in Python?

Type /bye to exit.

3.2 Which Model Should You Actually Use?

The Ollama model library has 100+ entries. Here is the practical shortlist for 2026:

Model	Best version	When to use it
DeepSeek-R1	7b, 14b	Coding, math, reasoning. The current open-weight champion.
Qwen2.5	7b, 14b, 32b	Chinese-language tasks. Also strong on code.
Llama 3.2 / 3.3	8b, 11b	General purpose. Best ecosystem support.
Gemma 2	9b	Fast inference, small footprint. Google's best open model.
Mistral / Mixtral	7b, 8x7b	Good balance of speed and capability. European provenance.
Phi-3	mini, medium	Microsoft's small model. Surprisingly capable for the size.

How to browse all available models:

# This opens the library in your browser
start https://ollama.com/library    # Windows
open https://ollama.com/library     # macOS
xdg-open https://ollama.com/library # Linux

3.3 Essential CLI Commands

ollama run <name>      # Pull (if needed) and start interactive session
ollama pull <name>     # Download model without starting session
ollama list            # Show locally downloaded models and their sizes
ollama show <name>     # Show model metadata (size, template, parameters)
ollama rm <name>       # Delete model from disk
ollama cp <src> <dst>  # Duplicate a model (useful before customizing)
ollama push <name>     # Push a custom model to a registry (advanced)

Part 4: Customizing Model Behavior with Modelfile

Ollama uses a Modelfile (analogous to a Dockerfile) to define custom model configurations.

4.1 A Practical Example

Suppose you want a version of DeepSeek-R1 that:
- Always responds in English
- Has a lower temperature (more deterministic output)
- Has a longer context window

Create a file called Modelfile:

FROM deepseek-r1:7b

SYSTEM """
You are a senior software engineer. You explain technical concepts clearly,
use concrete examples, and always include working code snippets.
Respond in English unless the user explicitly asks for another language.
"""

PARAMETER temperature 0.2
PARAMETER num_ctx 8192
PARAMETER top_p 0.9

Build it:

ollama create my-deepseek -f Modelfile

Run it:

ollama run my-deepseek

4.2 Parameters That Matter

Not all parameters in the docs are equally useful. Here are the ones that actually affect output quality:

Parameter	What it does	Good value for
`temperature`	Controls randomness (0 = deterministic, 1 = creative)	0.2–0.4 for code/QA; 0.7–0.9 for creative writing
`num_ctx`	Context window size (tokens)	4096 (default); 8192+ for long documents
`num_predict`	Max tokens to generate	2048 (default); increase for long-form output
`repeat_penalty`	Penalizes repetitive output	1.1 (default); increase to 1.2–1.3 if the model loops
`top_p`	Nucleus sampling threshold	0.9 (default); lower for more focused output
`stop`	Stop sequences (list)	`[ "", "###" ]` for structured output

Part 5: Using Ollama as an API Server

This is where Ollama becomes useful for real applications. When Ollama is running, it automatically serves an HTTP API on http://localhost:11434.

5.1 The Generate Endpoint

curl http://localhost:11434/api/generate -d '{
  "model": "deepseek-r1:7b",
  "prompt": "Write a Python function to detect cycle in a linked list",
  "stream": false,
  "options": {
    "temperature": 0.3
  }
}'

Setting "stream": false returns the full response as a single JSON object. For streaming (better UX in chat apps), omit it or set it to true.

5.2 The Chat Endpoint (OpenAI-compatible)

curl http://localhost:11434/api/chat -d '{
  "model": "deepseek-r1:7b",
  "messages": [
    { "role": "user", "content": "What is dependency injection?" }
  ],
  "stream": false
}'

5.3 Using the Python Library

Ollama provides an official Python package:

pip install ollama

import ollama

# Simple generation
response = ollama.generate(
    model='deepseek-r1:7b',
    prompt='Explain the CAP theorem in one paragraph'
)
print(response['response'])

# Chat with conversation history
conversation = [
    {'role': 'system', 'content': 'You are a helpful coding tutor.'},
    {'role': 'user', 'content': 'What is a decorator in Python?'},
]
response = ollama.chat(model='deepseek-r1:7b', messages=conversation)
print(response['message']['content'])

# Streaming response (for real-time display)
stream = ollama.generate(
    model='deepseek-r1:7b',
    prompt='Count from 1 to 10',
    stream=True
)
for chunk in stream:
    print(chunk['response'], end='', flush=True)

5.4 Drop-in OpenAI Compatibility

If you have existing code that uses the OpenAI Python SDK, you can point it at Ollama with a one-line change:

from openai import OpenAI

client = OpenAI(
    base_url='http://localhost:11434/v1',
    api_key='not-needed'  # Ollama ignores this field locally
)

response = client.chat.completions.create(
    model='deepseek-r1:7b',
    messages=[{'role': 'user', 'content': 'Hello'}],
    temperature=0.3
)
print(response.choices[0].message.content)

This means you can develop against OpenAI's API and switch to local models for production — or vice versa — without rewriting your application logic.

Part 6: Troubleshooting (The Stuff Docs Do Not Tell You)

Problem 1: Model download stalls or fails repeatedly

Why it happens: Ollama pulls models from hosted registries. In some regions, connectivity is inconsistent. Fixes:

Update Ollama to the latest version (ollama --version and check against ollama.com)
Run ollama pull again — it supports resumable downloads
If a specific model version consistently fails, try a different quantization (e.g., deepseek-r1:7b-q4_0 instead of deepseek-r1:7b)
Check disk space. Models are large. ollama list shows sizes.

Problem 2: Response is extremely slow

Why it happens: The model is running on CPU because GPU was not detected, OR the model is too large for available memory and the system is swapping. How to diagnose:

# Check if GPU is being used (Linux/macOS)
ollama show <model>  # Look for "gpu_layers" info
# Check memory usage while model is loading
# macOS: Activity Monitor → Memory
# Windows: Task Manager → Performance → Memory

Fixes:

Switch to a smaller model
Use a quantized version (q4_0 or q5_0 in the model tag)
On Linux, verify GPU drivers are loaded (nvidia-smi for NVIDIA)
On macOS with Apple Silicon, make sure you are not running under Rosetta (check Activity Monitor → Kind column)

Problem 3: GPU not detected on Linux with NVIDIA

This is the most common Linux issue. Step-by-step diagnosis:

# Step 1: Check if NVIDIA driver is working
nvidia-smi
# If this fails, the driver is not installed correctly

# Step 2: Check if Ollama service can see the GPU
journalctl -u ollama --no-pager | grep -i gpu

# Step 3: Check Ollama logs for GPU discovery
journalctl -u ollama --no-pager --follow
# Then start a model and watch the logs

Common fix: The nvidia-container-toolkit package is missing (if running in Docker), or the driver version is too old. Ollama requires NVIDIA driver 525+ for CUDA 12 support.

Problem 4: GPU not detected on Linux with AMD

Ollama on Linux with AMD GPUs requires ROCm 7+. If your GPU is not detected:

# Check if ROCm is installed
rocminfo

# If rocminfo is not found, install ROCm 7 from AMD's official docs:
# https://rocm.docs.amd.com/projects/install-on-linux/en/latest/

If ROCm is installed but Ollama still uses CPU, check the Ollama logs:

journalctl -u ollama | grep -i "amd\|rocm\|gpu"

Problem 5: Chinese/non-English output quality is poor

Why it happens: Not all models are multilingual. Llama 3 (the base model) is primarily trained on English. For Chinese, Japanese, or Korean, use:

Qwen2.5 (best for Chinese)
DeepSeek-R1 (strong multilingual)
Gemma 2 with instruction tuning

Also make sure your prompt itself is in the target language — models follow the language of the prompt.

Problem 6: `connection refused` on `localhost:11434`

Why it happens: Ollama is not running, or it is running on a different port. Fix:

macOS/Windows: Check menu bar / system tray for the Ollama icon. If it is not there, relaunch the app.
Linux: sudo systemctl status ollama
If the service is running but the port is wrong: lsof -i :11434 (macOS/Linux) or netstat -ano | findstr 11434 (Windows)

Problem 7: Running out of disk space

Models are stored in ~/.ollama/models. A 7B model takes ~4–5 GB, a 14B model ~9 GB, and they add up.

# See all models and their sizes
ollama list

# Remove models you no longer need
ollama rm <model-name>

Part 7: Making It Actually Useful — Integrations

Ollama by itself is a CLI tool. The real power comes from connecting it to tools you already use.

7.1 Open WebUI (the "local ChatGPT" experience)

The single most impactful integration. Open WebUI gives you a ChatGPT-like web interface that connects to Ollama:

docker run -d -p 8080:8080 --name open-webui \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  ghcr.io/open-webui/open-webui:main

Then open http://localhost:8080. You get:
- Chat interface with history
- Support for multiple models
- Document upload (RAG)
- Image generation (if you have a vision model)
- User management (if you want to share it with a team)

7.2 Continue (VS Code integration)

Install the Continue extension in VS Code, then configure it to use Ollama:

{
  "models": [
    {
      "title": "Ollama DeepSeek",
      "provider": "ollama",
      "model": "deepseek-r1:7b"
    }
  ]
}

Now you have local AI code completion and chat directly in VS Code.

7.3 LangChain Integration

from langchain_ollama import OllamaLLM
from langchain_core.prompts import ChatPromptTemplate

llm = OllamaLLM(model="deepseek-r1:7b", temperature=0.3)

prompt = ChatPromptTemplate.from_template(
    "Write a {language} function that {task}"
)

chain = prompt | llm

result = chain.invoke({
    "language": "Python",
    "task": "reads a CSV file and prints the first 5 rows"
})
print(result)

Part 8: Serving Beyond localhost

By default, Ollama binds to 127.0.0.1 (localhost only). To expose it to your local network:

# Linux/macOS
export OLLAMA_HOST=0.0.0.0:11434
ollama serve

# Windows PowerShell
$env:OLLAMA_HOST="0.0.0.0:11434"
ollama serve

Security warning: Only do this on trusted networks. There is no authentication on the Ollama API. Anyone on your network who can reach port 11434 can use your models and consume your compute. For production use, put Ollama behind a reverse proxy with authentication (e.g., nginx + OAuth).

Part 9: Where to Go From Here

You now have a working local LLM setup. Here are the highest-leverage next steps:

Build a local RAG system — combine Ollama with a vector database (Chroma, Qdrant) to chat with your own documents
Set up a team instance — run Ollama on a server with a good GPU and let your team access it via Open WebUI
Experiment with model routing — use a small/fast model for simple queries and route complex queries to a larger model
Explore the Ollama ecosystem — tools like OpenClaw, n8n, and Continue all have Ollama integrations

Useful links:

Ollama docs: https://ollama.com/docs
Model library: https://ollama.com/library
GitHub: https://github.com/ollama/ollama
Discord: https://discord.gg/ollama
Open WebUI: https://openwebui.com

*This guide is part of the repohot.com open-source AI tools tutorial series. Discover more tools and tutorials at repohot.com.*

Related Tool: Ollama Complete Guide: Run Large Language Models Locally — From Zero to Production — A complete hands-on guide to running LLMs locally with Ollama. Covers installati…

← Back to tutorials