Running Local AI Models with Ollama and OpenClaw on Windows

Cloud AI APIs are powerful, but they come with costs, latency, and privacy considerations. Running models locally with Ollama gives you complete control — your data never leaves your machine, there are no per-token charges, and you can run inference 24/7 without rate limits.

This guide walks you through setting up Ollama on Windows and integrating it with OpenClaw for a hybrid local/cloud AI workflow.

Why Run Models Locally?

Benefit	Description
Privacy	Data never leaves your network
Cost	Zero per-token charges after hardware investment
Availability	No rate limits or API outages
Speed	Local inference can be faster for small models
Offline	Works without internet connection

The trade-off: local models require capable hardware and are generally less powerful than frontier cloud models. The smart approach is hybrid — use local models for routine tasks and cloud models for complex reasoning.

Hardware Requirements

Before choosing a model, understand your system's capabilities:

RAM is King

For CPU inference, your available RAM determines max model size:

System RAM	Max Model Size	Recommended Models
8 GB	3B parameters	llama3.2:3b, phi3:mini
16 GB	7-8B parameters	llama3.1:8b, mistral:7b
32 GB	13-14B parameters	llama3.1:13b, qwen2.5:14b
64 GB+	30B+ parameters	llama3.1:70b (slow), mixtral:8x7b

Rule of thumb: Model file size ≈ parameters × 0.5-1 GB (depending on quantization). You need at least that much free RAM, plus overhead for context.

GPU Acceleration

NVIDIA GPUs dramatically speed up inference:

VRAM	Capability
4 GB	3B models fully in VRAM
8 GB	7B models, partial 13B
12 GB	13B models comfortably
24 GB+	30B+ models, large context

AMD GPUs work via ROCm on Linux but have limited Windows support. For Windows, NVIDIA is the practical choice.

Check Your Specs

Open PowerShell and run:

# Check RAM
(Get-CimInstance Win32_ComputerSystem).TotalPhysicalMemory / 1GB

# Check GPU
Get-CimInstance Win32_VideoController | Select-Object Name, AdapterRAM

Step 1: Install Ollama on Windows

Download Ollama from ollama.com/download and run the installer.

After installation, Ollama runs as a background service. Verify it's working:

ollama --version
ollama list

Step 2: Pull Your First Model

Choose a model based on your hardware. For a 16GB RAM system:

ollama pull llama3.2

Popular models for different use cases:

Model	Size	Best For
llama3.2:3b	2 GB	Fast responses, simple tasks
llama3.1:8b	4.7 GB	General purpose, good balance
mistral:7b	4.1 GB	Instruction following
codellama:7b	3.8 GB	Code generation
qwen2.5:14b	9 GB	Quality reasoning (needs 32GB RAM)
phi3:mini	2.2 GB	Lightweight, surprisingly capable

Test your model:

ollama run llama3.2 "Explain Docker in one sentence"

Step 3: Enable Network Access

By default, Ollama only listens on localhost. To access it from other machines (like your OpenClaw server), you need to enable network binding.

Option A: Environment Variable

Set the OLLAMA_HOST environment variable:

# PowerShell (temporary)
$env:OLLAMA_HOST = "0.0.0.0:11434"

# Or set permanently via System Properties > Environment Variables
# Variable: OLLAMA_HOST
# Value: 0.0.0.0:11434

Restart Ollama after setting the variable.

Option B: Windows Firewall

Allow Ollama through the firewall:

New-NetFirewallRule -DisplayName "Ollama" -Direction Inbound -Port 11434 -Protocol TCP -Action Allow

Verify Network Access

From another machine on your network:

curl http://YOUR_WINDOWS_IP:11434/api/tags

You should see a JSON response listing your installed models.

Step 4: Secure Remote Access with Tailscale

For accessing your Windows Ollama from anywhere (like a cloud VPS running OpenClaw), use Tailscale:

Install Tailscale on your Windows machine: tailscale.com/download
Sign in and connect to your tailnet
Note your Tailscale IP (e.g., 100.x.x.x)

Now you can access Ollama securely from any device on your tailnet:

curl http://100.x.x.x:11434/api/tags

No port forwarding, no firewall holes, encrypted by default.

Step 5: Configure OpenClaw

Add your Ollama instance as a provider in OpenClaw. Edit ~/.openclaw/openclaw.json:

{
  "auth": {
    "profiles": {
      "ollama:local": {
        "provider": "ollama",
        "baseUrl": "http://100.x.x.x:11434"
      }
    }
  }
}

Replace 100.x.x.x with your Tailscale IP (or local IP if on the same network).

Step 6: Use Local Models for Specific Tasks

Here's where it gets powerful: assign local models to specific agents or tasks while keeping cloud models for complex work.

Strategy: Task-Based Model Routing

{
  "agents": {
    "defaults": {
      "model": {
        "primary": "github-copilot/claude-opus-4.5"
      }
    },
    "list": [
      {
        "id": "main",
        "comment": "Main orchestrator uses cloud model"
      },
      {
        "id": "researcher",
        "model": "ollama/llama3.2",
        "comment": "Research tasks use local model - free!"
      },
      {
        "id": "summarizer", 
        "model": "ollama/llama3.2",
        "comment": "Summarization on local model"
      }
    ]
  }
}

What to Run Locally vs Cloud

Good for local models:

Summarization
Simple Q&A
Data extraction
Code formatting
Translation
Research scanning

Better on cloud models:

Complex reasoning
Long-form writing
Code architecture decisions
Multi-step planning
Tasks requiring latest knowledge

Cost Savings Example

Imagine a content pipeline that:

Scans RSS feeds for article ideas (research)
Writes full articles (creative)
Reviews and edits (analysis)

Route it like this:

Step 1: Local llama3.2 — $0
Step 2: Cloud claude-opus — ~$0.02-0.10
Step 3: Local llama3.2 — $0

You've cut your API costs by 66% while maintaining quality where it matters.

Choosing the Right Model

By Task Type

Task	Recommended	Why
Chat/QA	llama3.2:3b	Fast, good enough
Summarization	llama3.1:8b	Needs comprehension
Code	codellama:7b	Specialized training
Analysis	qwen2.5:14b	Better reasoning
Creative writing	Cloud model	Local models struggle

By Hardware

Budget setup (16GB RAM, no GPU):

llama3.2:3b — snappy responses
phi3:mini — alternative, good reasoning for size

Mid-range (32GB RAM, RTX 3060):

llama3.1:8b — great all-rounder
mistral:7b — excellent instruction following

High-end (64GB RAM, RTX 4090):

qwen2.5:14b — approaches cloud quality
mixtral:8x7b — MoE architecture, fast

Performance Tuning

Increase Context Length

For longer conversations:

ollama run llama3.2 --num-ctx 8192

GPU Layers

Control how much runs on GPU vs CPU:

# In modelfile or runtime
ollama run llama3.2 --num-gpu 999  # All layers on GPU
ollama run llama3.2 --num-gpu 0    # CPU only

Keep Model Loaded

By default, Ollama unloads models after 5 minutes. Keep frequently-used models loaded:

# Set via environment variable
$env:OLLAMA_KEEP_ALIVE = "24h"

Monitoring Your Setup

Check what's running:

ollama ps

Check model details:

ollama show llama3.2

Monitor from OpenClaw by querying the API:

curl http://YOUR_IP:11434/api/tags | jq '.models[] | {name, size}'

Troubleshooting

"Connection refused"

Verify Ollama is running: ollama ps
Check OLLAMA_HOST is set to 0.0.0.0:11434
Verify firewall allows port 11434

Slow inference

Check if model fits in RAM (avoid swap)
Ensure GPU is being utilized: watch Task Manager during inference
Try a smaller model or increase GPU layers

Out of memory

Close other applications
Use a smaller model
Reduce context length: --num-ctx 2048

Conclusion

Running Ollama on Windows gives you a private, cost-free AI inference engine. Combined with OpenClaw's agent system, you can build sophisticated workflows that:

Use local models for routine, high-volume tasks
Reserve cloud models for complex reasoning
Keep sensitive data on your own hardware
Eliminate API costs for appropriate workloads

The hybrid approach — local + cloud — gives you the best of both worlds. Start with a small model like llama3.2, measure what works, then scale up your local capabilities as needed.

Your AI infrastructure, your rules.

Explore more models at ollama.com/library. For OpenClaw configuration help, visit docs.openclaw.ai.