Running Local AI Models with Ollama and OpenClaw on Windows
Cloud AI APIs are powerful, but they come with costs, latency, and privacy considerations. Running models locally with Ollama gives you complete control — your data never leaves your machine, there are no per-token charges, and you can run inference 24/7 without rate limits.
This guide walks you through setting up Ollama on Windows and integrating it with OpenClaw for a hybrid local/cloud AI workflow.
Why Run Models Locally?
| Benefit | Description |
|---|---|
| Privacy | Data never leaves your network |
| Cost | Zero per-token charges after hardware investment |
| Availability | No rate limits or API outages |
| Speed | Local inference can be faster for small models |
| Offline | Works without internet connection |
The trade-off: local models require capable hardware and are generally less powerful than frontier cloud models. The smart approach is hybrid — use local models for routine tasks and cloud models for complex reasoning.
Hardware Requirements
Before choosing a model, understand your system's capabilities:
RAM is King
For CPU inference, your available RAM determines max model size:
| System RAM | Max Model Size | Recommended Models |
|---|---|---|
| 8 GB | 3B parameters | llama3.2:3b, phi3:mini |
| 16 GB | 7-8B parameters | llama3.1:8b, mistral:7b |
| 32 GB | 13-14B parameters | llama3.1:13b, qwen2.5:14b |
| 64 GB+ | 30B+ parameters | llama3.1:70b (slow), mixtral:8x7b |
Rule of thumb: Model file size ≈ parameters × 0.5-1 GB (depending on quantization). You need at least that much free RAM, plus overhead for context.
GPU Acceleration
NVIDIA GPUs dramatically speed up inference:
| VRAM | Capability |
|---|---|
| 4 GB | 3B models fully in VRAM |
| 8 GB | 7B models, partial 13B |
| 12 GB | 13B models comfortably |
| 24 GB+ | 30B+ models, large context |
AMD GPUs work via ROCm on Linux but have limited Windows support. For Windows, NVIDIA is the practical choice.
Check Your Specs
Open PowerShell and run:
# Check RAM
(Get-CimInstance Win32_ComputerSystem).TotalPhysicalMemory / 1GB
# Check GPU
Get-CimInstance Win32_VideoController | Select-Object Name, AdapterRAM
Step 1: Install Ollama on Windows
Download Ollama from ollama.com/download and run the installer.
After installation, Ollama runs as a background service. Verify it's working:
ollama --version
ollama list
Step 2: Pull Your First Model
Choose a model based on your hardware. For a 16GB RAM system:
ollama pull llama3.2
Popular models for different use cases:
| Model | Size | Best For |
|---|---|---|
| llama3.2:3b | 2 GB | Fast responses, simple tasks |
| llama3.1:8b | 4.7 GB | General purpose, good balance |
| mistral:7b | 4.1 GB | Instruction following |
| codellama:7b | 3.8 GB | Code generation |
| qwen2.5:14b | 9 GB | Quality reasoning (needs 32GB RAM) |
| phi3:mini | 2.2 GB | Lightweight, surprisingly capable |
Test your model:
ollama run llama3.2 "Explain Docker in one sentence"
Step 3: Enable Network Access
By default, Ollama only listens on localhost. To access it from other machines (like your OpenClaw server), you need to enable network binding.
Option A: Environment Variable
Set the OLLAMA_HOST environment variable:
# PowerShell (temporary)
$env:OLLAMA_HOST = "0.0.0.0:11434"
# Or set permanently via System Properties > Environment Variables
# Variable: OLLAMA_HOST
# Value: 0.0.0.0:11434
Restart Ollama after setting the variable.
Option B: Windows Firewall
Allow Ollama through the firewall:
New-NetFirewallRule -DisplayName "Ollama" -Direction Inbound -Port 11434 -Protocol TCP -Action Allow
Verify Network Access
From another machine on your network:
curl http://YOUR_WINDOWS_IP:11434/api/tags
You should see a JSON response listing your installed models.
Step 4: Secure Remote Access with Tailscale
For accessing your Windows Ollama from anywhere (like a cloud VPS running OpenClaw), use Tailscale:
- Install Tailscale on your Windows machine: tailscale.com/download
- Sign in and connect to your tailnet
- Note your Tailscale IP (e.g.,
100.x.x.x)
Now you can access Ollama securely from any device on your tailnet:
curl http://100.x.x.x:11434/api/tags
No port forwarding, no firewall holes, encrypted by default.
Step 5: Configure OpenClaw
Add your Ollama instance as a provider in OpenClaw. Edit ~/.openclaw/openclaw.json:
{
"auth": {
"profiles": {
"ollama:local": {
"provider": "ollama",
"baseUrl": "http://100.x.x.x:11434"
}
}
}
}
Replace 100.x.x.x with your Tailscale IP (or local IP if on the same network).
Step 6: Use Local Models for Specific Tasks
Here's where it gets powerful: assign local models to specific agents or tasks while keeping cloud models for complex work.
Strategy: Task-Based Model Routing
{
"agents": {
"defaults": {
"model": {
"primary": "github-copilot/claude-opus-4.5"
}
},
"list": [
{
"id": "main",
"comment": "Main orchestrator uses cloud model"
},
{
"id": "researcher",
"model": "ollama/llama3.2",
"comment": "Research tasks use local model - free!"
},
{
"id": "summarizer",
"model": "ollama/llama3.2",
"comment": "Summarization on local model"
}
]
}
}
What to Run Locally vs Cloud
Good for local models:
- Summarization
- Simple Q&A
- Data extraction
- Code formatting
- Translation
- Research scanning
Better on cloud models:
- Complex reasoning
- Long-form writing
- Code architecture decisions
- Multi-step planning
- Tasks requiring latest knowledge
Cost Savings Example
Imagine a content pipeline that:
- Scans RSS feeds for article ideas (research)
- Writes full articles (creative)
- Reviews and edits (analysis)
Route it like this:
- Step 1: Local llama3.2 — $0
- Step 2: Cloud claude-opus — ~$0.02-0.10
- Step 3: Local llama3.2 — $0
You've cut your API costs by 66% while maintaining quality where it matters.
Choosing the Right Model
By Task Type
| Task | Recommended | Why |
|---|---|---|
| Chat/QA | llama3.2:3b | Fast, good enough |
| Summarization | llama3.1:8b | Needs comprehension |
| Code | codellama:7b | Specialized training |
| Analysis | qwen2.5:14b | Better reasoning |
| Creative writing | Cloud model | Local models struggle |
By Hardware
Budget setup (16GB RAM, no GPU):
llama3.2:3b — snappy responses
phi3:mini — alternative, good reasoning for size
Mid-range (32GB RAM, RTX 3060):
llama3.1:8b — great all-rounder
mistral:7b — excellent instruction following
High-end (64GB RAM, RTX 4090):
qwen2.5:14b — approaches cloud quality
mixtral:8x7b — MoE architecture, fast
Performance Tuning
Increase Context Length
For longer conversations:
ollama run llama3.2 --num-ctx 8192
GPU Layers
Control how much runs on GPU vs CPU:
# In modelfile or runtime
ollama run llama3.2 --num-gpu 999 # All layers on GPU
ollama run llama3.2 --num-gpu 0 # CPU only
Keep Model Loaded
By default, Ollama unloads models after 5 minutes. Keep frequently-used models loaded:
# Set via environment variable
$env:OLLAMA_KEEP_ALIVE = "24h"
Monitoring Your Setup
Check what's running:
ollama ps
Check model details:
ollama show llama3.2
Monitor from OpenClaw by querying the API:
curl http://YOUR_IP:11434/api/tags | jq '.models[] | {name, size}'
Troubleshooting
"Connection refused"
- Verify Ollama is running:
ollama ps - Check OLLAMA_HOST is set to
0.0.0.0:11434 - Verify firewall allows port 11434
Slow inference
- Check if model fits in RAM (avoid swap)
- Ensure GPU is being utilized: watch Task Manager during inference
- Try a smaller model or increase GPU layers
Out of memory
- Close other applications
- Use a smaller model
- Reduce context length:
--num-ctx 2048
Conclusion
Running Ollama on Windows gives you a private, cost-free AI inference engine. Combined with OpenClaw's agent system, you can build sophisticated workflows that:
- Use local models for routine, high-volume tasks
- Reserve cloud models for complex reasoning
- Keep sensitive data on your own hardware
- Eliminate API costs for appropriate workloads
The hybrid approach — local + cloud — gives you the best of both worlds. Start with a small model like llama3.2, measure what works, then scale up your local capabilities as needed.
Your AI infrastructure, your rules.
Explore more models at ollama.com/library. For OpenClaw configuration help, visit docs.openclaw.ai.