Part 01 of 15Hands-On SetupWindows & LinuxStep by StepAPI Code IncludedFreeLearning365
From Zero to Private AI System — Complete Series
Install Ollama & Run Your First Local AI Model: Complete Hands-On Guide
From downloading Ollama to making your first live API call — every command, every config file, every error fix, every model explained. Windows and Linux. GPU and CPU. This is your complete Day 1 local AI setup guide.
By @FreeLearning365Part 01 — Local AI SetupRead time: ~35 minSkill level: Beginner to IntermediateWindows 10/11 & Ubuntu 20.04+
"In the next 35 minutes, you will go from zero local AI to a fully running language model on your own hardware — responding to your questions, processing your company data, and doing it all without a single byte leaving your network. This is the moment your company's private AI story begins."
What this post covers
- What Ollama is and how it works
- System requirements — CPU, RAM, GPU
- Install on Windows 10/11 (full walkthrough)
- Install on Ubuntu/Linux (full walkthrough)
- Every major model — detailed comparison
- Quantization explained simply
- Running your first model interactively
- Ollama CLI — all commands explained
- REST API deep dive with real examples
- Multi-turn conversation via API
- Streaming responses explained
- Environment variables — full config reference
- Network access — exposing to your LAN
- Troubleshooting — 10 common errors fixed
- Do's, Don'ts, Limitations
- SEO metadata & banner prompt
Section 1 — What is Ollama and how does it actually work?
Before installing anything, you need a mental model of what Ollama does — because this understanding will save you hours of confusion later.
Ollama is an open-source application runtime specifically designed to make running large language models on local hardware as simple as possible. Think of it as a service manager, model downloader, memory manager, and API server — all packaged into a single tool.
When you install Ollama, it runs as a background service on your machine. It listens on port 11434 by default. When you run a model, Ollama loads the model weights into RAM (or VRAM if you have a GPU), keeps it resident in memory for fast subsequent queries, and exposes a clean REST API that any application — your browser, your ERP system, your Python script, your C# service — can call.
The Ollama architecture in plain language
Your application sends an HTTP POST request to http://localhost:11434/api/chat with your message. Ollama receives the request, passes it to the loaded model, streams the response back token by token, and your application displays it. The model runs entirely in your computer's memory. No internet. No cloud. No API key. No cost per call.
Ollama's internal component map
// What Ollama manages for you automatically:
Model Registry → Downloads models from ollama.com/library
Model Storage → Stores .gguf model files on your disk
Runtime Engine → Uses llama.cpp under the hood for inference
Memory Manager → Loads model into RAM or VRAM automatically
GPU Acceleration → Detects NVIDIA/AMD/Apple Silicon automatically
REST API Server → Exposes HTTP endpoints on port 11434
Model Keep-Alive → Keeps model warm in memory between requests
Concurrent Requests→ Queues multiple requests to the loaded model
The underlying engine Ollama uses is llama.cpp — a highly optimized C++ inference runtime that can run quantized LLM models efficiently on both CPU and GPU. Ollama wraps llama.cpp with a user-friendly interface, model management, and a standardized API. You never need to touch llama.cpp directly.
Section 2 — System requirements: What hardware do you actually need?
One of the most common questions is "can my machine run this?" The answer depends on which model you want to run. Here is the honest breakdown.
Minimum (CPU only)
Basic workstation / old laptop
RAM: 8GB (runs 2B–3B models only)
CPU: Any Intel i5/i7 or AMD Ryzen 5+
Storage: 20GB free (model files are large)
GPU: Not required
OS: Windows 10, Ubuntu 20.04+
Speed: Slow (2–8 tokens/sec)
Best model: Gemma 2 2B, Phi-3 3.8B
Recommended (with GPU)
Developer workstation / company server
RAM: 16–32GB system RAM
CPU: Intel i7/i9 or Ryzen 7/9
Storage: 100GB free SSD
GPU: NVIDIA RTX 3060 (12GB VRAM)+
OS: Windows 10/11, Ubuntu 22.04
Speed: Fast (30–60 tokens/sec)
Best model: LLaMA 3.1 8B, Mistral 7B
Production server
Company AI server (10–50 users)
RAM: 32–64GB system RAM
CPU: Xeon or Threadripper
Storage: 500GB+ NVMe SSD
GPU: NVIDIA RTX 4090 (24GB VRAM)
OS: Ubuntu 22.04 LTS (recommended)
Speed: Very fast (80–120 tokens/sec)
Best model: LLaMA 3.1 8B Q8, Mistral 7B
Important — RAM rule for models
A model needs roughly 1GB of RAM per 1 billion parameters at 8-bit quantization. So a 7B model needs ~7GB VRAM or RAM. A 13B model needs ~13GB. Always leave 20% overhead for the OS. If your VRAM cannot fit the model, it spills into system RAM and becomes very slow. Match your model size to your hardware — we cover this in detail in the model selection section below.
Checking your GPU before installation
Windows — check GPU
PowerShell or Command Prompt
nvidia-smi
-- If you see your GPU listed with VRAM amount,
-- Ollama will detect and use it automatically.
-- Example output:
-- | NVIDIA GeForce RTX 3090
-- | Memory-Usage: 0MiB / 24576MiB
-- No NVIDIA? Check for AMD:
wmic path win32_VideoController get name
-- Ollama supports AMD via ROCm on Linux only
-- On Windows, AMD users run CPU mode
Linux — check GPU
Terminal
nvidia-smi
-- Check NVIDIA GPU and VRAM
lspci | grep -i vga
-- List all display adapters
free -h
-- Check total system RAM
-- Example:
-- total used free
-- Mem: 31Gi 4Gi 26Gi
df -h /
-- Check available disk space
Section 3 — Install Ollama on Windows 10/11: Full walkthrough
Windows installation is the most straightforward — a single installer handles everything including the background service, PATH configuration, and API server startup.
Step 1 — Download the Windows installer
Open your browser and navigate to ollama.com. Click the "Download for Windows" button. This downloads OllamaSetup.exe — typically 50–80MB. Do not run it yet.
Before running the installer — check this
If you have an NVIDIA GPU, install the latest NVIDIA drivers FIRST (from nvidia.com/drivers). Ollama auto-detects CUDA — but only if proper drivers are present at install time. Skipping this means CPU-only mode until you reinstall.
Step 2 — Run the installer
Installation process — what happens automatically
// OllamaSetup.exe does ALL of the following:
1. Installs Ollama binary to:
C:\Users\[YourName]\AppData\Local\Programs\Ollama\
2. Adds Ollama to your system PATH environment variable
3. Creates and starts a Windows Service:
"Ollama" — runs on system startup automatically
4. Creates model storage directory:
C:\Users\[YourName]\.ollama\models\
5. Opens firewall rule for port 11434 (localhost only)
6. Starts the Ollama background service immediately
// After installer completes, verify installation:
ollama --version
// Expected output: ollama version 0.x.x
ollama list
// Expected output: empty list (no models yet)
// NAME ID SIZE MODIFIED
Step 3 — Verify the Ollama service is running
PowerShell — verify service status
// Method 1 — Check via PowerShell
Get-Service -Name "Ollama"
// Expected: Status = Running
// Method 2 — Check via browser
// Open: http://localhost:11434
// Expected page text: "Ollama is running"
// Method 3 — Check via curl (PowerShell)
curl http://localhost:11434
// Expected: Ollama is running
// If service is not running, start it manually:
Start-Service Ollama
// Or from Start Menu → search "Ollama" → click the app
Step 4 — Configure model storage location (optional but recommended)
By default, Ollama stores model files in your user profile directory. On a company server with a dedicated data drive, you should change this to avoid filling your system drive with large model files (a 7B model is ~4–8GB).
Windows — change model storage path
// Method: Set environment variable OLLAMA_MODELS
// Go to: System Properties → Advanced → Environment Variables
// Add NEW System Variable:
Variable name: OLLAMA_MODELS
Variable value: D:\AI\OllamaModels
// Then restart the Ollama service:
Restart-Service Ollama
// Verify the new path is active:
ollama list
// Models will now download to D:\AI\OllamaModels\
Section 4 — Install Ollama on Ubuntu/Linux: Full walkthrough
Linux installation is even simpler — a single curl command handles everything. Ubuntu 20.04, 22.04, and 24.04 are all fully supported. The installer also handles NVIDIA CUDA detection automatically on Linux.
Step 1 — Prerequisites check
Terminal — verify prerequisites
# Check Ubuntu version
lsb_release -a
# Expected: Ubuntu 20.04 / 22.04 / 24.04
# Check available disk space (need at least 20GB)
df -h /
# Example: /dev/sda1 500G 120G 380G 24% /
# Check if curl is installed
curl --version
# If not installed: sudo apt install curl -y
# For NVIDIA GPU users — check driver status:
nvidia-smi
# If command not found, install NVIDIA drivers:
# sudo apt install nvidia-driver-535 -y
# sudo reboot
# Then run nvidia-smi again to verify
Step 2 — Install Ollama (one command)
Terminal — install Ollama
# The official one-line installer:
curl -fsSL https://ollama.com/install.sh | sh
# What this script does:
# 1. Detects your OS and CPU architecture
# 2. Downloads the correct Ollama binary
# 3. Installs it to /usr/local/bin/ollama
# 4. Creates a systemd service: ollama.service
# 5. Starts the service automatically
# 6. Creates user: ollama (runs the service)
# 7. Creates model directory: /usr/share/ollama/.ollama/models/
# 8. Detects NVIDIA GPU and configures CUDA automatically
# Expected output (with NVIDIA GPU):
>>> Downloading ollama...
>>> Installing ollama to /usr/local/bin...
>>> NVIDIA GPU driver detected. Using GPU mode.
>>> Creating ollama user...
>>> Adding ollama user to 'ollama' group...
>>> Adding current user to 'ollama' group...
>>> Creating ollama systemd service...
>>> Enabling and starting ollama service...
>>> The Ollama API is now available at 0.0.0.0:11434.
>>> Install complete. Run "ollama" from the command line.
Step 3 — Verify installation and service
Terminal — verify everything is working
# Check version
ollama --version
# ollama version 0.x.x
# Check systemd service status
systemctl status ollama
# Expected:
# ● ollama.service - Ollama Service
# Loaded: loaded (/etc/systemd/system/ollama.service)
# Active: active (running) since ...
# Main PID: 12345 (ollama)
# Test the API endpoint
curl http://localhost:11434
# Expected: Ollama is running
# Check if GPU is being used by Ollama
ollama ps
# (No models loaded yet — will show empty table)
Step 4 — Configure model storage on Linux
Linux — customize Ollama configuration
# Edit the systemd service to add environment variables
sudo systemctl edit ollama
# This opens a drop-in config file. Add these lines:
[Service]
Environment="OLLAMA_MODELS=/data/ollama/models"
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_KEEP_ALIVE=5m"
Environment="OLLAMA_NUM_PARALLEL=2"
# Save and close, then reload and restart:
sudo systemctl daemon-reload
sudo systemctl restart ollama
# Verify config was applied:
systemctl show ollama | grep Environment
# Should list your environment variables
# Create the models directory with correct permissions:
sudo mkdir -p /data/ollama/models
sudo chown -R ollama:ollama /data/ollama/
Section 5 — Understanding every major available model
This is the section that saves you hours of trial and error. Choosing the wrong model for your hardware or use case is the most common beginner mistake. Read this carefully before pulling any model.
What are model parameters?
Parameters are the numerical weights inside the model — the values learned during training. More parameters generally means more capability, but also more memory needed and slower inference. A "7B model" has 7 billion parameters. The relationship is not purely linear — a well-trained 7B model (like Mistral 7B) can outperform a poorly trained 13B model on many tasks.
Quantization — how to run large models on small hardware
Full precision (FP32) stores each parameter as a 32-bit float. That is enormous. Quantization reduces precision to 4-bit or 8-bit, drastically shrinking memory requirements with minimal quality loss. This is what makes running a 7B model on 8GB VRAM possible.
Q2_K
~2.7GB for 7B
Lowest quality. Fastest. Avoid for production.
Q4_K_M
~4.1GB for 7B
Best balance. Default choice. Recommended.
Q5_K_M
~5.0GB for 7B
Very good quality. Use if VRAM allows.
Q8_0
~7.7GB for 7B
Near-full quality. Use for production servers.
Which quantization to choose?
For most company deployments: use Q4_K_M by default — it is what Ollama downloads unless you specify otherwise. If you have 24GB VRAM (RTX 4090 or A6000), use Q8_0 for noticeably better output quality. Never use Q2 or Q3 in production — the quality degradation is significant.
Complete model reference — every major model explained
LLaMA 3.1 8BBest overall
CreatorMeta
Pull cmdollama pull llama3.1:8b
Size (Q4)~4.9GB
RAM needed8GB VRAM or 16GB RAM
Context128K tokens
The sweet spot for company deployments. Excellent reasoning, multilingual support, instruction following, and long document handling. 128K context means it can process very large documents in one pass. Strong Bengali capability. The default recommendation for any company with 16GB+ RAM.
Mistral 7B InstructTop alternative
CreatorMistral AI
Pull cmdollama pull mistral:7b-instruct
Size (Q4)~4.1GB
RAM needed8GB VRAM or 12GB RAM
Context32K tokens
Extremely capable for its size. Best-in-class instruction following. Excellent for structured output, document summarization, and Q&A. Slightly faster than LLaMA 3.1 on equivalent hardware. Great for companies with 12–16GB RAM systems. Slightly weaker Bengali support than LLaMA 3.1.
Gemma 2 2B / 9BLow resource
CreatorGoogle
Pull cmdollama pull gemma2:2b
Size (Q4)~1.6GB (2B) / ~5.5GB (9B)
RAM needed4GB (2B) / 8GB (9B)
Context8K tokens
The 2B version is ideal for very low-spec hardware or edge devices. Surprisingly capable for basic Q&A and summarization. The 9B version is excellent and comparable to LLaMA 3.1 8B in many tasks. Best starting model for testing on minimal hardware.
Phi-3.5 / Phi-4Small but powerful
CreatorMicrosoft
Pull cmdollama pull phi3.5 / phi4
Size (Q4)~2.2GB (3.8B) / ~9GB (14B)
RAM needed6GB (3.8B) / 16GB (14B)
Context128K tokens (Phi-3.5)
Microsoft's research triumph. Phi-3.5 Mini (3.8B) outperforms many 7B models on reasoning and coding tasks. Excellent SQL generation capability — ideal for ERP integrations. Phi-4 (14B) is one of the best models available at any size for analytical tasks.
Qwen2.5 7B / 14BBest multilingual
CreatorAlibaba
Pull cmdollama pull qwen2.5:7b
Size (Q4)~4.4GB (7B) / ~9GB (14B)
RAM needed8GB (7B) / 16GB (14B)
Context128K tokens
Best Bengali and multilingual support among all local models. Trained on significantly more multilingual data than Meta or Google models. If your team primarily communicates in Bengali, Qwen2.5 7B is the model to start with. Also excellent at coding and structured data tasks.
DeepSeek-R1 7B / 14BBest reasoning
CreatorDeepSeek AI
Pull cmdollama pull deepseek-r1:7b
Size (Q4)~4.7GB (7B) / ~9.3GB (14B)
RAM needed8GB (7B) / 16GB (14B)
Context128K tokens
Specifically trained for chain-of-thought reasoning. Shows its "thinking process" before giving an answer — excellent for complex analysis, financial reasoning, and multi-step problem solving. Responses are longer (due to reasoning steps) but noticeably more accurate on complex tasks.
CodeLlama 7B / 13BBest for code
CreatorMeta
Pull cmdollama pull codellama:7b
Size (Q4)~3.8GB (7B)
RAM needed8GB
Context16K tokens
Purpose-built for code generation. Excellent for generating SQL stored procedures, C# classes, ASP.NET controllers, and debugging code. Fine-tuned specifically on code from many languages. Pairs perfectly with your ERP development workflow for generating boilerplate and reviewing logic.
Nomic Embed TextEmbeddings only
CreatorNomic AI
Pull cmdollama pull nomic-embed-text
Size~274MB
RAM needed<1GB
Use caseRAG pipeline only
Not a chat model — purely for generating vector embeddings used in RAG pipelines (Part 05). Converts text into numerical vectors for semantic search. Required in Part 05 when we build document intelligence. Pull this now — you will need it later. Tiny and fast.
Section 6 — Pulling and running your first model
Pulling a model (downloading)
Terminal / PowerShell — pull your first model
# For low-spec machines (8GB RAM, no GPU):
ollama pull gemma2:2b
# pulling manifest
# pulling 879c7f77f9d6... 100% ████████████ 1.6 GB
# pulling 43070e2d4e53... 100% ████████████ 11 KB
# success
# For standard company workstation (16GB RAM, GPU):
ollama pull llama3.1:8b
# pulling manifest
# pulling 62fbfd9ed093... 100% ████████████ 4.9 GB
# success
# For Bengali-focused deployments:
ollama pull qwen2.5:7b
# Pull the embedding model (needed for Part 05):
ollama pull nomic-embed-text
# Check what you have downloaded:
ollama list
# NAME ID SIZE MODIFIED
# llama3.1:8b 42182419e950 4.9 GB 2 minutes ago
# gemma2:2b ff02c3702f32 1.6 GB 5 minutes ago
# nomic-embed-text:latest 0a109f422b47 274 MB 1 minute ago
Running a model interactively
Interactive terminal session — your first AI conversation
ollama run llama3.1:8b
>>> Send a message (/? for help)
>>> Summarize what Ollama is in 3 bullet points
• Ollama is a local AI runtime that lets you run large language
models directly on your own hardware without cloud dependency.
• It manages model downloads, GPU acceleration, and exposes a
REST API at localhost:11434 for application integration.
• It supports models like LLaMA, Mistral, Gemma, and Phi,
making private AI accessible without technical complexity.
>>> আমাদের কোম্পানির ডেটা নিরাপদ রাখতে আমরা কী করতে পারি?
আপনার কোম্পানির ডেটা নিরাপদ রাখার জন্য কিছু গুরুত্বপূর্ণ পদক্ষেপ:
১. স্থানীয় AI ব্যবহার করুন — ডেটা কখনো বাইরে যাবে না
২. কর্মচারীদের সচেতন করুন কোন তথ্য শেয়ার করা যাবে না
৩. এক্সেস কন্ট্রোল সিস্টেম তৈরি করুন...
# Exit the interactive session:
/bye
Useful interactive commands
Ollama interactive mode — all slash commands
/set system "You are an ERP assistant..." # Set system prompt
/set temperature 0.1 # Lower = more focused
/set context 4096 # Set context window size
/show info # Show model details
/show license # Show model license
/clear # Clear conversation history
/save my_session # Save conversation
/load my_session # Load saved conversation
/? # Show all available commands
/bye # Exit interactive mode
Section 7 — Ollama CLI: Every command you need to know
Complete Ollama CLI reference
# ── MODEL MANAGEMENT ──────────────────────────────────
ollama pull llama3.1:8b # Download a model
ollama pull llama3.1:8b-instruct-q8_0 # Pull specific quantization
ollama run llama3.1:8b # Run model interactively
ollama list # List all downloaded models
ollama show llama3.1:8b # Show model details + parameters
ollama rm llama3.1:8b # Delete a model (free disk space)
ollama cp llama3.1:8b mycompany-ai # Copy/rename a model
# ── RUNTIME MONITORING ────────────────────────────────
ollama ps # Show running models + memory usage
ollama --version # Show Ollama version
# Example 'ollama ps' output:
# NAME ID SIZE PROCESSOR UNTIL
# llama3.1:8b 42182419 6.0 GB 100% GPU 4 minutes from now
# ── CUSTOM MODELS (Modelfile) ─────────────────────────
ollama create mymodel -f ./Modelfile # Create from Modelfile
ollama push mymodel # Push to Ollama registry (if signed in)
# ── SERVE (for custom port/host) ──────────────────────
ollama serve # Start server manually (if not as service)
OLLAMA_HOST=0.0.0.0:11434 ollama serve # Bind to all interfaces
Section 8 — Creating a custom Modelfile: Give AI your company's personality
A Modelfile is Ollama's equivalent of a Dockerfile — it defines how a model behaves, what system prompt it uses, what parameters it runs with, and what name it gets. This is how you create a company-branded AI assistant with consistent behavior across all users.
Your first company Modelfile
File: /home/ubuntu/ai-setup/Modelfile.company
FROM llama3.1:8b
# System prompt — defines the AI's identity and behavior
SYSTEM """
You are an intelligent business assistant for Dhaka Traders Ltd.,
a wholesale trading company based in Bangladesh.
Your responsibilities:
- Answer questions about company policies, procedures, and ERP data
- Help employees draft professional emails and documents
- Summarize reports and explain data trends
- Respond in the same language the user writes in
(Bengali or English — never mix unless asked)
- For Bengali: always use formal business Bengali (চলিত ভাষা)
- For English: be concise, professional, and structured
Strict rules:
- Never reveal internal salary data or confidential pricing
- Never make up statistics — say "I don't have that data"
- Always recommend verifying critical decisions with management
- If unsure, say so honestly rather than guessing
"""
# Model parameters
PARAMETER temperature 0.3 # Lower = more consistent, factual
PARAMETER top_p 0.9 # Nucleus sampling — keeps responses focused
PARAMETER top_k 40 # Vocabulary breadth
PARAMETER num_ctx 8192 # Context window — 8K tokens
PARAMETER num_predict 2048 # Max response length
PARAMETER repeat_penalty 1.1 # Reduce repetition
# Optional: Add example conversation to guide behavior
MESSAGE user "আমাদের রিটার্ন পলিসি কী?"
MESSAGE assistant "আমাদের রিটার্ন পলিসি অনুযায়ী, পণ্য ক্রয়ের ৭ দিনের মধ্যে ফেরত দেওয়া যাবে, তবে পণ্যটি অক্ষত থাকতে হবে এবং মূল রসিদ দেখাতে হবে।"
Terminal — create and test your company model
# Create the custom model from your Modelfile
ollama create dhaka-traders-ai -f ./Modelfile.company
# transferring model data
# creating model layer
# creating template layer
# creating system layer
# creating parameters layer
# writing manifest
# success
# Verify it appears in your model list
ollama list
# NAME ID SIZE
# dhaka-traders-ai a1b2c3d4e5f6 4.9 GB
# llama3.1:8b 42182419e950 4.9 GB
# Run and test your company model
ollama run dhaka-traders-ai
>>> What is our return policy?
According to our policy, products can be returned within 7 days
of purchase, provided the item is undamaged and the original
receipt is presented. Refunds are processed within 3 business days.
Section 9 — The REST API: Making your first API call
The REST API is the most important part of this entire setup — it is how your ERP system, web applications, and custom tools will communicate with your local AI. Understand this section deeply before moving to Part 06 (ERP integration).
API endpoint reference
1
POST /api/generate — single turn completion
Send a prompt, get a response. No conversation history. Best for single-shot tasks like summarization, classification, or code generation.
2
POST /api/chat — multi-turn conversation
Send full message history array. Maintains context across multiple turns. Required for chatbot interfaces and conversational workflows.
3
POST /api/embeddings — generate vector embeddings
Convert text to numerical vectors for RAG pipelines. Uses embedding models like nomic-embed-text. Returns a float array. Used in Part 05.
4
GET /api/tags — list available models
Returns all models currently downloaded and available on your Ollama server. Use this in your app's model selection dropdown.
5
POST /api/pull — pull a model via API
Programmatically trigger a model download. Useful for admin dashboards that manage the AI server remotely.
6
DELETE /api/delete — remove a model
Delete a model from disk via API. Use in admin tools for model lifecycle management.
API call 1 — Simple generate (curl)
Terminal / PowerShell — your first API call
# Linux/Mac terminal:
curl http://localhost:11434/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.1:8b",
"prompt": "Summarize the top 3 benefits of using private AI for a company",
"stream": false
}'
# Response (formatted for readability):
{
"model": "llama3.1:8b",
"created_at": "2024-01-15T08:23:11Z",
"response": "1. Data Privacy: All processing happens locally...\n
2. Cost Efficiency: No per-query API fees...\n
3. Full Control: Customize system prompts...",
"done": true,
"total_duration": 8234567890,
"load_duration": 1234567,
"prompt_eval_count": 18,
"eval_count": 127,
"eval_duration": 7890123456
}
# Windows PowerShell equivalent:
Invoke-RestMethod -Uri "http://localhost:11434/api/generate" `
-Method POST `
-ContentType "application/json" `
-Body '{"model":"llama3.1:8b","prompt":"Hello","stream":false}'
API call 2 — Chat with message history
Terminal — multi-turn chat API call
curl http://localhost:11434/api/chat \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.1:8b",
"stream": false,
"messages": [
{
"role": "system",
"content": "You are an ERP assistant for Dhaka Traders Ltd.
Always respond professionally.
Answer only work-related questions."
},
{
"role": "user",
"content": "What should I check first when a customer
complains about a delayed delivery?"
}
]
}'
# Response:
{
"model": "llama3.1:8b",
"message": {
"role": "assistant",
"content": "When investigating a delayed delivery complaint,
check these items in order:\n
1. Verify the order status in the ERP system...\n
2. Check the dispatch date vs promised delivery date...\n
3. Contact the logistics partner for tracking update...\n
4. Document everything in the customer interaction log."
},
"done": true
}
API call 3 — Streaming response (real-time token display)
Streaming API — tokens arrive one by one as generated
curl http://localhost:11434/api/chat \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.1:8b",
"stream": true,
"messages": [
{"role":"user","content":"Explain what RAG means in 2 sentences"}
]
}'
# With stream:true, you receive multiple JSON objects:
# Each line is one token as it is generated:
{"model":"llama3.1:8b","message":{"role":"assistant","content":"RAG"},"done":false}
{"model":"llama3.1:8b","message":{"role":"assistant","content":" stands"},"done":false}
{"model":"llama3.1:8b","message":{"role":"assistant","content":" for"},"done":false}
{"model":"llama3.1:8b","message":{"role":"assistant","content":" Retrieval"},"done":false}
...
{"model":"llama3.1:8b","message":{"role":"assistant","content":""},"done":true,
"total_duration":4567890123,"eval_count":89}
# This streaming pattern is what makes responses feel "live"
# in browser-based chat interfaces like Open WebUI
API call 4 — Generate embeddings (for RAG)
Embeddings API — used in Part 05 RAG pipeline
curl http://localhost:11434/api/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "nomic-embed-text",
"prompt": "Customer Karim placed an order for 500 units of Product A"
}'
# Response — a vector of 768 floating point numbers:
{
"embedding": [
0.0023064255, -0.014368402, -0.00983825,
0.035714887, -0.024517306, 0.05992762,
... (768 total values)
]
}
# This vector mathematically represents the meaning of the sentence
# Similar sentences produce mathematically similar vectors
# This is the foundation of the RAG system in Part 05
API call 5 — List available models
Tags API — build a model selector in your application
curl http://localhost:11434/api/tags
# Response:
{
"models": [
{
"name": "llama3.1:8b",
"modified_at": "2024-01-15T06:00:00Z",
"size": 4920000000,
"digest": "sha256:42182419e950...",
"details": {
"parameter_size": "8.0B",
"quantization_level": "Q4_K_M",
"family": "llama",
"format": "gguf"
}
},
{
"name": "qwen2.5:7b",
"modified_at": "2024-01-15T07:00:00Z",
"size": 4400000000,
...
}
]
}
Section 10 — Environment variables: Complete configuration reference
Ollama's behavior is controlled through environment variables. These are critical for production deployment. Every company deployment should configure these deliberately — default values are for development only.
OLLAMA_HOST
The IP address and port Ollama listens on. Use localhost for single-machine setups. Use 0.0.0.0 to allow LAN access from other computers.
Default: 127.0.0.1:11434
OLLAMA_MODELS
Directory where model files are stored. Change this to a large disk if your system drive is small. Models are 2–10GB each.
Default: ~/.ollama/models
OLLAMA_KEEP_ALIVE
How long a model stays loaded in memory after the last request. Set to "0" to unload immediately. Set to "-1" to keep forever. "5m" for 5 minutes.
Default: 5m
OLLAMA_NUM_PARALLEL
Maximum number of requests processed simultaneously. Higher = more concurrent users but more VRAM. On 24GB GPU, setting 2–3 is reasonable.
Default: 1
OLLAMA_MAX_LOADED_MODELS
Maximum number of models loaded in memory at once. Set to 1 if VRAM is limited. Set to 2–3 on high-VRAM servers to serve multiple models simultaneously.
Default: 1
OLLAMA_GPU_OVERHEAD
VRAM overhead reserved for the OS and other processes. Increase if getting out-of-memory errors. Measured in bytes.
Default: 0
CUDA_VISIBLE_DEVICES
Specify which GPU(s) Ollama uses (for multi-GPU servers). "0" = first GPU only. "0,1" = use both. "-1" = force CPU mode.
Default: all GPUs
OLLAMA_DEBUG
Enable verbose debug logging. Set to "1" to see detailed inference logs. Useful when troubleshooting slow performance or GPU detection issues.
Default: 0 (disabled)
Recommended production configuration
Production environment variable configuration — company server
# Linux — /etc/systemd/system/ollama.service.d/override.conf
# Windows — System Properties → Environment Variables
[Service]
# Bind to all interfaces so LAN clients can reach the server
Environment="OLLAMA_HOST=0.0.0.0:11434"
# Store models on a dedicated data drive
Environment="OLLAMA_MODELS=/data/ollama/models"
# Keep model in memory for 10 minutes after last use
# Prevents slow reload between queries from different users
Environment="OLLAMA_KEEP_ALIVE=10m"
# Allow 2 parallel requests (good for 24GB VRAM)
Environment="OLLAMA_NUM_PARALLEL=2"
# Only allow 1 model loaded at a time (for VRAM conservation)
Environment="OLLAMA_MAX_LOADED_MODELS=1"
# Reserve 512MB VRAM for OS (prevents out-of-memory crashes)
Environment="OLLAMA_GPU_OVERHEAD=536870912"
# Apply changes:
sudo systemctl daemon-reload
sudo systemctl restart ollama
systemctl status ollama
Section 11 — Enabling LAN access: Let your entire team use the AI server
By default, Ollama only accepts connections from the same machine (localhost). To allow employees on your company network to send requests to the AI server, you need to bind Ollama to the server's network interface.
Security warning before enabling LAN access
Exposing Ollama on your LAN without authentication means ANY device on your network can send unlimited requests to your AI server. For initial testing this is fine. For production, always put NGINX with authentication in front of Ollama. We cover this fully in Part 10 (Security). For now, ensure your Ollama port is not exposed to the internet — only your internal company network.
Enable LAN access — Linux systemd
# Edit Ollama service environment:
sudo systemctl edit ollama
# Add this line in the [Service] section:
Environment="OLLAMA_HOST=0.0.0.0:11434"
sudo systemctl daemon-reload
sudo systemctl restart ollama
# Get your server's LAN IP address:
ip addr show | grep "inet " | grep -v 127.0.0.1
# Example: inet 192.168.1.100/24
# Test from another machine on the same network:
curl http://192.168.1.100:11434
# Expected: Ollama is running
# Open firewall port (Ubuntu with ufw):
sudo ufw allow from 192.168.1.0/24 to any port 11434
sudo ufw status
Enable LAN access — Windows
# Set environment variable (as Administrator):
[System.Environment]::SetEnvironmentVariable(
"OLLAMA_HOST", "0.0.0.0:11434", "Machine")
# Restart Ollama service:
Restart-Service Ollama
# Allow through Windows Firewall:
New-NetFirewallRule -DisplayName "Ollama LAN Access" `
-Direction Inbound `
-Protocol TCP `
-LocalPort 11434 `
-RemoteAddress LocalSubnet `
-Action Allow
# Get your Windows server's IP:
ipconfig | findstr "IPv4"
# IPv4 Address: 192.168.1.100
# Test from another machine:
# Open browser → http://192.168.1.100:11434
# Should show: Ollama is running
Section 12 — Troubleshooting: 10 common errors and how to fix them
Error 1 — 'ollama' is not recognized as a command
Cause: Ollama binary not in system PATH. Usually means installer did not complete or PATH was not refreshed.
Fix (Windows): Close all terminal windows and open a NEW PowerShell window. PATH changes only apply to new sessions. If still failing: System Properties → Environment Variables → verify C:\Users\[Name]\AppData\Local\Programs\Ollama is in your PATH.
Fix (Linux): Run source ~/.bashrc or log out and back in. Then which ollama should return /usr/local/bin/ollama.
Error 2 — Model runs on CPU instead of GPU (very slow)
Cause: NVIDIA drivers not installed or CUDA not detected by Ollama.
Diagnose: Run ollama run llama3.1:8b, then in another terminal run nvidia-smi and watch GPU memory — if memory does not increase, GPU is not being used.
Fix: Install/update NVIDIA drivers. Restart Ollama service. On Linux: sudo apt install nvidia-driver-535 -y && sudo reboot. After reboot, verify with nvidia-smi then ollama ps should show "GPU" in the PROCESSOR column.
Error 3 — "out of memory" or model crashes immediately
Cause: Model requires more VRAM than available on your GPU.
Fix Option A: Switch to a smaller model. If 7B fails, try 2B or 3B. Run ollama pull gemma2:2b as a test.
Fix Option B: Force CPU mode temporarily: CUDA_VISIBLE_DEVICES="" ollama run llama3.1:8b. Slower but works on any machine.
Fix Option C: Use a more aggressively quantized version: ollama pull llama3.1:8b-instruct-q4_0 instead of the default Q4_K_M.
Error 4 — API returns "connection refused" at port 11434
Cause: Ollama service is not running.
Fix (Windows): Start-Service Ollama in PowerShell as Administrator. Or search "Ollama" in Start Menu and click the app.
Fix (Linux): sudo systemctl start ollama. Then systemctl status ollama to verify. If it fails to start, check logs: journalctl -u ollama -f.
Error 5 — Model pull fails or download stalls
Cause: Network issue, insufficient disk space, or Ollama registry timeout.
Fix: Check disk space first: df -h (Linux) or Get-PSDrive C (Windows). Need at least 2x model size free during download. If stalled, press Ctrl+C and re-run ollama pull — it resumes from where it stopped. Check DNS: try ping ollama.com to verify connectivity.
Error 6 — Responses are extremely slow (under 1 token/sec)
Cause: Model is running on CPU instead of GPU, or model is too large for available VRAM and is partially swapping to RAM.
Diagnose: Run ollama ps during inference — check PROCESSOR column. Should say "GPU". If it says "CPU", the GPU is not being used. Check nvidia-smi for VRAM usage.
Fix: Switch to smaller model, update GPU drivers, or add OLLAMA_GPU_OVERHEAD=536870912 environment variable to reserve VRAM for stability.
Error 7 — "context length exceeded" in API response
Cause: Your prompt + conversation history exceeds the model's context window.
Fix: Either reduce the amount of text in your prompt, truncate conversation history in multi-turn chats, or switch to a model with a larger context window (LLaMA 3.1 8B supports 128K tokens). In the API, set "num_ctx": 4096 explicitly in your request to avoid surprises.
Error 8 — LAN clients cannot reach the Ollama server
Cause: OLLAMA_HOST not set to 0.0.0.0, or firewall blocking port 11434.
Fix: Verify OLLAMA_HOST environment variable is set: systemctl show ollama | grep Environment. Check firewall: sudo ufw status. Ensure port 11434 is allowed from your LAN subnet. Test from the server itself first: curl http://0.0.0.0:11434.
Error 9 — Bengali text appears garbled or as boxes
Cause: Terminal or application does not support Unicode UTF-8 encoding.
Fix (Windows terminal): Run chcp 65001 in Command Prompt to switch to UTF-8. Or use Windows Terminal app instead of CMD — it handles Unicode natively. In PowerShell: [Console]::OutputEncoding = [System.Text.Encoding]::UTF8.
Fix (API/app): Ensure your HTTP client sends Accept-Charset: utf-8 and your response parser handles UTF-8. In C#, use Encoding.UTF8 when reading the response stream.
Error 10 — Model loaded but not using full GPU capacity
Cause: Model is partially offloaded to CPU because VRAM is insufficient for the full model at current quantization.
Diagnose: Run ollama show llama3.1:8b and look at the "num_gpu" parameter. If it shows a fraction, the model is split across GPU and CPU.
Fix: Use a more aggressively quantized version (Q4_0 instead of Q8_0). Or switch to a smaller model that fits fully in VRAM. A fully GPU-resident model is 3–10x faster than a split GPU/CPU model.
Section 13 — Do's, Don'ts, and limitations
Do — best practices
- Start with gemma2:2b to test your setup
- Upgrade to llama3.1:8b once hardware confirmed
- Pull nomic-embed-text now for later RAG use
- Create a Modelfile for your company persona
- Use Q4_K_M quantization as your default
- Monitor GPU memory with nvidia-smi
- Set OLLAMA_KEEP_ALIVE for production servers
- Store models on a dedicated large drive
- Test API with curl before writing application code
- Document which model version is deployed
Don't — avoid these mistakes
- Don't run 7B+ models on 8GB RAM without GPU
- Don't expose port 11434 to the internet directly
- Don't ignore GPU driver setup before install
- Don't use Q2 quantization for production tasks
- Don't use the same model for all use cases
- Don't skip testing the API before ERP integration
- Don't set OLLAMA_NUM_PARALLEL too high (VRAM risk)
- Don't store models on the OS system drive
- Don't assume AI output is always correct — validate
- Don't run two large models loaded simultaneously
Honest limitations of local AI at this stage
What local AI cannot do well (yet)
// Speed comparison — local 7B vs GPT-4 (cloud)
Local LLaMA 3.1 8B on RTX 4090: ~80-120 tokens/sec
Local LLaMA 3.1 8B on RTX 3060: ~30-50 tokens/sec
Local LLaMA 3.1 8B on CPU only: ~3-8 tokens/sec
GPT-4 Turbo (cloud): ~50-100 tokens/sec
// Quality comparison — approximate task accuracy
Complex reasoning: Local 8B = ~75% of GPT-4 quality
Simple Q&A: Local 8B = ~90% of GPT-4 quality
Code generation: Local 8B = ~80% of GPT-4 quality
Bengali quality: Local 8B = ~65% of GPT-4 quality (improving)
Long documents: Local 8B = ~70% of GPT-4 quality
// What local AI is NOT suitable for (at this model size):
// - Real-time complex financial modeling
// - Advanced legal document analysis
// - Medical diagnosis support
// - Tasks requiring internet search (no built-in web access)
// - Vision/image analysis (standard LLaMA — use LLaVA for images)
0 Comments
thanks for your comments!