What is FreeLearning365?

FreeLearning365 is a free learning platform offering daily tutorials on technology, programming, and certifications.

Part 01 of 15Hands-On SetupWindows & LinuxStep by StepAPI Code IncludedFreeLearning365

From Zero to Private AI System — Complete Series

Install Ollama & Run Your First Local AI Model: Complete Hands-On Guide

From downloading Ollama to making your first live API call — every command, every config file, every error fix, every model explained. Windows and Linux. GPU and CPU. This is your complete Day 1 local AI setup guide.

By @FreeLearning365Part 01 — Local AI SetupRead time: ~35 minSkill level: Beginner to IntermediateWindows 10/11 & Ubuntu 20.04+

"In the next 35 minutes, you will go from zero local AI to a fully running language model on your own hardware — responding to your questions, processing your company data, and doing it all without a single byte leaving your network. This is the moment your company's private AI story begins."

command to install

models covered

cloud calls needed

total software cost

What this post covers

What Ollama is and how it works
System requirements — CPU, RAM, GPU
Install on Windows 10/11 (full walkthrough)
Install on Ubuntu/Linux (full walkthrough)
Every major model — detailed comparison
Quantization explained simply
Running your first model interactively
Ollama CLI — all commands explained
REST API deep dive with real examples
Multi-turn conversation via API
Streaming responses explained
Environment variables — full config reference
Network access — exposing to your LAN
Troubleshooting — 10 common errors fixed
Do's, Don'ts, Limitations
SEO metadata & banner prompt

Section 1 — What is Ollama and how does it actually work?

Before installing anything, you need a mental model of what Ollama does — because this understanding will save you hours of confusion later.

Ollama is an open-source application runtime specifically designed to make running large language models on local hardware as simple as possible. Think of it as a service manager, model downloader, memory manager, and API server — all packaged into a single tool.

When you install Ollama, it runs as a background service on your machine. It listens on port 11434 by default. When you run a model, Ollama loads the model weights into RAM (or VRAM if you have a GPU), keeps it resident in memory for fast subsequent queries, and exposes a clean REST API that any application — your browser, your ERP system, your Python script, your C# service — can call.

The Ollama architecture in plain language

Your application sends an HTTP POST request to http://localhost:11434/api/chat with your message. Ollama receives the request, passes it to the loaded model, streams the response back token by token, and your application displays it. The model runs entirely in your computer's memory. No internet. No cloud. No API key. No cost per call.

Ollama's internal component map

// What Ollama manages for you automatically:

Model Registry     → Downloads models from ollama.com/library
Model Storage      → Stores .gguf model files on your disk
Runtime Engine     → Uses llama.cpp under the hood for inference
Memory Manager     → Loads model into RAM or VRAM automatically
GPU Acceleration   → Detects NVIDIA/AMD/Apple Silicon automatically
REST API Server    → Exposes HTTP endpoints on port 11434
Model Keep-Alive   → Keeps model warm in memory between requests
Concurrent Requests→ Queues multiple requests to the loaded model

The underlying engine Ollama uses is llama.cpp — a highly optimized C++ inference runtime that can run quantized LLM models efficiently on both CPU and GPU. Ollama wraps llama.cpp with a user-friendly interface, model management, and a standardized API. You never need to touch llama.cpp directly.

Section 2 — System requirements: What hardware do you actually need?

One of the most common questions is "can my machine run this?" The answer depends on which model you want to run. Here is the honest breakdown.

Minimum (CPU only)

Basic workstation / old laptop

RAM: 8GB (runs 2B–3B models only)
CPU: Any Intel i5/i7 or AMD Ryzen 5+
Storage: 20GB free (model files are large)
GPU: Not required
OS: Windows 10, Ubuntu 20.04+
Speed: Slow (2–8 tokens/sec)
Best model: Gemma 2 2B, Phi-3 3.8B

Recommended (with GPU)

Developer workstation / company server

RAM: 16–32GB system RAM
CPU: Intel i7/i9 or Ryzen 7/9
Storage: 100GB free SSD
GPU: NVIDIA RTX 3060 (12GB VRAM)+
OS: Windows 10/11, Ubuntu 22.04
Speed: Fast (30–60 tokens/sec)
Best model: LLaMA 3.1 8B, Mistral 7B

Production server

Company AI server (10–50 users)

RAM: 32–64GB system RAM
CPU: Xeon or Threadripper
Storage: 500GB+ NVMe SSD
GPU: NVIDIA RTX 4090 (24GB VRAM)
OS: Ubuntu 22.04 LTS (recommended)
Speed: Very fast (80–120 tokens/sec)
Best model: LLaMA 3.1 8B Q8, Mistral 7B

Important — RAM rule for models

A model needs roughly 1GB of RAM per 1 billion parameters at 8-bit quantization. So a 7B model needs ~7GB VRAM or RAM. A 13B model needs ~13GB. Always leave 20% overhead for the OS. If your VRAM cannot fit the model, it spills into system RAM and becomes very slow. Match your model size to your hardware — we cover this in detail in the model selection section below.

Checking your GPU before installation

Windows — check GPU

PowerShell or Command Prompt

nvidia-smi
-- If you see your GPU listed with VRAM amount,
-- Ollama will detect and use it automatically.
-- Example output:
-- | NVIDIA GeForce RTX 3090
-- | Memory-Usage: 0MiB / 24576MiB

-- No NVIDIA? Check for AMD:
wmic path win32_VideoController get name
-- Ollama supports AMD via ROCm on Linux only
-- On Windows, AMD users run CPU mode

Linux — check GPU

Terminal

nvidia-smi
-- Check NVIDIA GPU and VRAM

lspci | grep -i vga
-- List all display adapters

free -h
-- Check total system RAM
-- Example:
--        total  used  free
-- Mem:   31Gi   4Gi   26Gi

df -h /
-- Check available disk space

Section 3 — Install Ollama on Windows 10/11: Full walkthrough

Windows installation is the most straightforward — a single installer handles everything including the background service, PATH configuration, and API server startup.

Step 1 — Download the Windows installer

Open your browser and navigate to ollama.com. Click the "Download for Windows" button. This downloads OllamaSetup.exe — typically 50–80MB. Do not run it yet.

Before running the installer — check this

If you have an NVIDIA GPU, install the latest NVIDIA drivers FIRST (from nvidia.com/drivers). Ollama auto-detects CUDA — but only if proper drivers are present at install time. Skipping this means CPU-only mode until you reinstall.

Step 2 — Run the installer

Installation process — what happens automatically

// OllamaSetup.exe does ALL of the following:

1. Installs Ollama binary to:
   C:\Users\[YourName]\AppData\Local\Programs\Ollama\

2. Adds Ollama to your system PATH environment variable

3. Creates and starts a Windows Service:
   "Ollama" — runs on system startup automatically

4. Creates model storage directory:
   C:\Users\[YourName]\.ollama\models\

5. Opens firewall rule for port 11434 (localhost only)

6. Starts the Ollama background service immediately

// After installer completes, verify installation:
ollama --version
// Expected output: ollama version 0.x.x

ollama list
// Expected output: empty list (no models yet)
// NAME    ID    SIZE    MODIFIED

Step 3 — Verify the Ollama service is running

PowerShell — verify service status

// Method 1 — Check via PowerShell
Get-Service -Name "Ollama"
// Expected: Status = Running

// Method 2 — Check via browser
// Open: http://localhost:11434
// Expected page text: "Ollama is running"

// Method 3 — Check via curl (PowerShell)
curl http://localhost:11434
// Expected: Ollama is running

// If service is not running, start it manually:
Start-Service Ollama
// Or from Start Menu → search "Ollama" → click the app

Step 4 — Configure model storage location (optional but recommended)

By default, Ollama stores model files in your user profile directory. On a company server with a dedicated data drive, you should change this to avoid filling your system drive with large model files (a 7B model is ~4–8GB).

Windows — change model storage path

// Method: Set environment variable OLLAMA_MODELS
// Go to: System Properties → Advanced → Environment Variables
// Add NEW System Variable:

Variable name:  OLLAMA_MODELS
Variable value: D:\AI\OllamaModels

// Then restart the Ollama service:

Restart-Service Ollama

// Verify the new path is active:
ollama list
// Models will now download to D:\AI\OllamaModels\

Section 4 — Install Ollama on Ubuntu/Linux: Full walkthrough

Linux installation is even simpler — a single curl command handles everything. Ubuntu 20.04, 22.04, and 24.04 are all fully supported. The installer also handles NVIDIA CUDA detection automatically on Linux.

Step 1 — Prerequisites check

Terminal — verify prerequisites

# Check Ubuntu version
lsb_release -a
# Expected: Ubuntu 20.04 / 22.04 / 24.04

# Check available disk space (need at least 20GB)
df -h /
# Example: /dev/sda1  500G  120G  380G  24%  /

# Check if curl is installed
curl --version
# If not installed: sudo apt install curl -y

# For NVIDIA GPU users — check driver status:
nvidia-smi
# If command not found, install NVIDIA drivers:
# sudo apt install nvidia-driver-535 -y
# sudo reboot
# Then run nvidia-smi again to verify

Step 2 — Install Ollama (one command)

Terminal — install Ollama

# The official one-line installer:
curl -fsSL https://ollama.com/install.sh | sh

# What this script does:
# 1. Detects your OS and CPU architecture
# 2. Downloads the correct Ollama binary
# 3. Installs it to /usr/local/bin/ollama
# 4. Creates a systemd service: ollama.service
# 5. Starts the service automatically
# 6. Creates user: ollama (runs the service)
# 7. Creates model directory: /usr/share/ollama/.ollama/models/
# 8. Detects NVIDIA GPU and configures CUDA automatically

# Expected output (with NVIDIA GPU):
>>> Downloading ollama...
>>> Installing ollama to /usr/local/bin...
>>> NVIDIA GPU driver detected. Using GPU mode.
>>> Creating ollama user...
>>> Adding ollama user to 'ollama' group...
>>> Adding current user to 'ollama' group...
>>> Creating ollama systemd service...
>>> Enabling and starting ollama service...
>>> The Ollama API is now available at 0.0.0.0:11434.
>>> Install complete. Run "ollama" from the command line.

Step 3 — Verify installation and service

Terminal — verify everything is working

# Check version
ollama --version
# ollama version 0.x.x

# Check systemd service status
systemctl status ollama
# Expected:
# ● ollama.service - Ollama Service
#      Loaded: loaded (/etc/systemd/system/ollama.service)
#      Active: active (running) since ...
#  Main PID: 12345 (ollama)

# Test the API endpoint
curl http://localhost:11434
# Expected: Ollama is running

# Check if GPU is being used by Ollama
ollama ps
# (No models loaded yet — will show empty table)

Step 4 — Configure model storage on Linux

Linux — customize Ollama configuration

# Edit the systemd service to add environment variables
sudo systemctl edit ollama

# This opens a drop-in config file. Add these lines:
[Service]
Environment="OLLAMA_MODELS=/data/ollama/models"
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_KEEP_ALIVE=5m"
Environment="OLLAMA_NUM_PARALLEL=2"

# Save and close, then reload and restart:
sudo systemctl daemon-reload
sudo systemctl restart ollama

# Verify config was applied:
systemctl show ollama | grep Environment
# Should list your environment variables

# Create the models directory with correct permissions:
sudo mkdir -p /data/ollama/models
sudo chown -R ollama:ollama /data/ollama/

Section 5 — Understanding every major available model

This is the section that saves you hours of trial and error. Choosing the wrong model for your hardware or use case is the most common beginner mistake. Read this carefully before pulling any model.

What are model parameters?

Parameters are the numerical weights inside the model — the values learned during training. More parameters generally means more capability, but also more memory needed and slower inference. A "7B model" has 7 billion parameters. The relationship is not purely linear — a well-trained 7B model (like Mistral 7B) can outperform a poorly trained 13B model on many tasks.

Quantization — how to run large models on small hardware

Full precision (FP32) stores each parameter as a 32-bit float. That is enormous. Quantization reduces precision to 4-bit or 8-bit, drastically shrinking memory requirements with minimal quality loss. This is what makes running a 7B model on 8GB VRAM possible.

Q2_K

~2.7GB for 7B

Lowest quality. Fastest. Avoid for production.

Q4_K_M

~4.1GB for 7B

Best balance. Default choice. Recommended.

Q5_K_M

~5.0GB for 7B

Very good quality. Use if VRAM allows.

Q8_0

~7.7GB for 7B

Near-full quality. Use for production servers.

Which quantization to choose?

For most company deployments: use Q4_K_M by default — it is what Ollama downloads unless you specify otherwise. If you have 24GB VRAM (RTX 4090 or A6000), use Q8_0 for noticeably better output quality. Never use Q2 or Q3 in production — the quality degradation is significant.

Complete model reference — every major model explained

LLaMA 3.1 8BBest overall

CreatorMeta

Pull cmdollama pull llama3.1:8b

Size (Q4)~4.9GB

RAM needed8GB VRAM or 16GB RAM

Context128K tokens

The sweet spot for company deployments. Excellent reasoning, multilingual support, instruction following, and long document handling. 128K context means it can process very large documents in one pass. Strong Bengali capability. The default recommendation for any company with 16GB+ RAM.

Mistral 7B InstructTop alternative

CreatorMistral AI

Pull cmdollama pull mistral:7b-instruct

Size (Q4)~4.1GB

RAM needed8GB VRAM or 12GB RAM

Context32K tokens

Extremely capable for its size. Best-in-class instruction following. Excellent for structured output, document summarization, and Q&A. Slightly faster than LLaMA 3.1 on equivalent hardware. Great for companies with 12–16GB RAM systems. Slightly weaker Bengali support than LLaMA 3.1.

Gemma 2 2B / 9BLow resource

CreatorGoogle

Pull cmdollama pull gemma2:2b

Size (Q4)~1.6GB (2B) / ~5.5GB (9B)

RAM needed4GB (2B) / 8GB (9B)

Context8K tokens

The 2B version is ideal for very low-spec hardware or edge devices. Surprisingly capable for basic Q&A and summarization. The 9B version is excellent and comparable to LLaMA 3.1 8B in many tasks. Best starting model for testing on minimal hardware.

Phi-3.5 / Phi-4Small but powerful

CreatorMicrosoft

Pull cmdollama pull phi3.5 / phi4

Size (Q4)~2.2GB (3.8B) / ~9GB (14B)

RAM needed6GB (3.8B) / 16GB (14B)

Context128K tokens (Phi-3.5)

Microsoft's research triumph. Phi-3.5 Mini (3.8B) outperforms many 7B models on reasoning and coding tasks. Excellent SQL generation capability — ideal for ERP integrations. Phi-4 (14B) is one of the best models available at any size for analytical tasks.

Qwen2.5 7B / 14BBest multilingual

CreatorAlibaba

Pull cmdollama pull qwen2.5:7b

Size (Q4)~4.4GB (7B) / ~9GB (14B)

RAM needed8GB (7B) / 16GB (14B)

Context128K tokens

Best Bengali and multilingual support among all local models. Trained on significantly more multilingual data than Meta or Google models. If your team primarily communicates in Bengali, Qwen2.5 7B is the model to start with. Also excellent at coding and structured data tasks.

DeepSeek-R1 7B / 14BBest reasoning

CreatorDeepSeek AI

Pull cmdollama pull deepseek-r1:7b

Size (Q4)~4.7GB (7B) / ~9.3GB (14B)

RAM needed8GB (7B) / 16GB (14B)

Context128K tokens

Specifically trained for chain-of-thought reasoning. Shows its "thinking process" before giving an answer — excellent for complex analysis, financial reasoning, and multi-step problem solving. Responses are longer (due to reasoning steps) but noticeably more accurate on complex tasks.

CodeLlama 7B / 13BBest for code

CreatorMeta

Pull cmdollama pull codellama:7b

Size (Q4)~3.8GB (7B)

RAM needed8GB

Context16K tokens

Purpose-built for code generation. Excellent for generating SQL stored procedures, C# classes, ASP.NET controllers, and debugging code. Fine-tuned specifically on code from many languages. Pairs perfectly with your ERP development workflow for generating boilerplate and reviewing logic.

Nomic Embed TextEmbeddings only

CreatorNomic AI

Pull cmdollama pull nomic-embed-text

Size~274MB

RAM needed<1GB

Use caseRAG pipeline only

Not a chat model — purely for generating vector embeddings used in RAG pipelines (Part 05). Converts text into numerical vectors for semantic search. Required in Part 05 when we build document intelligence. Pull this now — you will need it later. Tiny and fast.

Section 6 — Pulling and running your first model

Pulling a model (downloading)

Terminal / PowerShell — pull your first model

# For low-spec machines (8GB RAM, no GPU):
ollama pull gemma2:2b
# pulling manifest
# pulling 879c7f77f9d6... 100% ████████████ 1.6 GB
# pulling 43070e2d4e53... 100% ████████████  11 KB
# success

# For standard company workstation (16GB RAM, GPU):
ollama pull llama3.1:8b
# pulling manifest
# pulling 62fbfd9ed093... 100% ████████████ 4.9 GB
# success

# For Bengali-focused deployments:
ollama pull qwen2.5:7b

# Pull the embedding model (needed for Part 05):
ollama pull nomic-embed-text

# Check what you have downloaded:
ollama list
# NAME                    ID              SIZE    MODIFIED
# llama3.1:8b             42182419e950    4.9 GB  2 minutes ago
# gemma2:2b               ff02c3702f32    1.6 GB  5 minutes ago
# nomic-embed-text:latest 0a109f422b47    274 MB  1 minute ago

Running a model interactively

Interactive terminal session — your first AI conversation

ollama run llama3.1:8b

>>> Send a message (/? for help)

>>> Summarize what Ollama is in 3 bullet points

• Ollama is a local AI runtime that lets you run large language
  models directly on your own hardware without cloud dependency.
• It manages model downloads, GPU acceleration, and exposes a
  REST API at localhost:11434 for application integration.
• It supports models like LLaMA, Mistral, Gemma, and Phi,
  making private AI accessible without technical complexity.

>>> আমাদের কোম্পানির ডেটা নিরাপদ রাখতে আমরা কী করতে পারি?

আপনার কোম্পানির ডেটা নিরাপদ রাখার জন্য কিছু গুরুত্বপূর্ণ পদক্ষেপ:
১. স্থানীয় AI ব্যবহার করুন — ডেটা কখনো বাইরে যাবে না
২. কর্মচারীদের সচেতন করুন কোন তথ্য শেয়ার করা যাবে না
৩. এক্সেস কন্ট্রোল সিস্টেম তৈরি করুন...

# Exit the interactive session:
/bye

Useful interactive commands

Ollama interactive mode — all slash commands

/set system "You are an ERP assistant..."  # Set system prompt
/set temperature 0.1                        # Lower = more focused
/set context 4096                          # Set context window size
/show info                                  # Show model details
/show license                              # Show model license
/clear                                      # Clear conversation history
/save my_session                           # Save conversation
/load my_session                           # Load saved conversation
/?                                          # Show all available commands
/bye                                        # Exit interactive mode

Section 7 — Ollama CLI: Every command you need to know

Complete Ollama CLI reference

# ── MODEL MANAGEMENT ──────────────────────────────────

ollama pull llama3.1:8b          # Download a model
ollama pull llama3.1:8b-instruct-q8_0  # Pull specific quantization
ollama run llama3.1:8b           # Run model interactively
ollama list                      # List all downloaded models
ollama show llama3.1:8b          # Show model details + parameters
ollama rm llama3.1:8b            # Delete a model (free disk space)
ollama cp llama3.1:8b mycompany-ai  # Copy/rename a model

# ── RUNTIME MONITORING ────────────────────────────────

ollama ps                        # Show running models + memory usage
ollama --version                 # Show Ollama version

# Example 'ollama ps' output:
# NAME            ID        SIZE      PROCESSOR  UNTIL
# llama3.1:8b     42182419  6.0 GB    100% GPU   4 minutes from now

# ── CUSTOM MODELS (Modelfile) ─────────────────────────

ollama create mymodel -f ./Modelfile  # Create from Modelfile
ollama push mymodel               # Push to Ollama registry (if signed in)

# ── SERVE (for custom port/host) ──────────────────────

ollama serve                     # Start server manually (if not as service)
OLLAMA_HOST=0.0.0.0:11434 ollama serve  # Bind to all interfaces

Section 8 — Creating a custom Modelfile: Give AI your company's personality

A Modelfile is Ollama's equivalent of a Dockerfile — it defines how a model behaves, what system prompt it uses, what parameters it runs with, and what name it gets. This is how you create a company-branded AI assistant with consistent behavior across all users.

Your first company Modelfile

File: /home/ubuntu/ai-setup/Modelfile.company

FROM llama3.1:8b

# System prompt — defines the AI's identity and behavior
SYSTEM """
You are an intelligent business assistant for Dhaka Traders Ltd.,
a wholesale trading company based in Bangladesh.

Your responsibilities:
- Answer questions about company policies, procedures, and ERP data
- Help employees draft professional emails and documents
- Summarize reports and explain data trends
- Respond in the same language the user writes in
  (Bengali or English — never mix unless asked)
- For Bengali: always use formal business Bengali (চলিত ভাষা)
- For English: be concise, professional, and structured

Strict rules:
- Never reveal internal salary data or confidential pricing
- Never make up statistics — say "I don't have that data"
- Always recommend verifying critical decisions with management
- If unsure, say so honestly rather than guessing
"""

# Model parameters
PARAMETER temperature 0.3      # Lower = more consistent, factual
PARAMETER top_p 0.9            # Nucleus sampling — keeps responses focused
PARAMETER top_k 40             # Vocabulary breadth
PARAMETER num_ctx 8192         # Context window — 8K tokens
PARAMETER num_predict 2048     # Max response length
PARAMETER repeat_penalty 1.1   # Reduce repetition

# Optional: Add example conversation to guide behavior
MESSAGE user "আমাদের রিটার্ন পলিসি কী?"
MESSAGE assistant "আমাদের রিটার্ন পলিসি অনুযায়ী, পণ্য ক্রয়ের ৭ দিনের মধ্যে ফেরত দেওয়া যাবে, তবে পণ্যটি অক্ষত থাকতে হবে এবং মূল রসিদ দেখাতে হবে।"

Terminal — create and test your company model

# Create the custom model from your Modelfile
ollama create dhaka-traders-ai -f ./Modelfile.company

# transferring model data
# creating model layer
# creating template layer
# creating system layer
# creating parameters layer
# writing manifest
# success

# Verify it appears in your model list
ollama list
# NAME                  ID              SIZE
# dhaka-traders-ai      a1b2c3d4e5f6    4.9 GB
# llama3.1:8b           42182419e950    4.9 GB

# Run and test your company model
ollama run dhaka-traders-ai

>>> What is our return policy?
According to our policy, products can be returned within 7 days
of purchase, provided the item is undamaged and the original
receipt is presented. Refunds are processed within 3 business days.

Section 9 — The REST API: Making your first API call

The REST API is the most important part of this entire setup — it is how your ERP system, web applications, and custom tools will communicate with your local AI. Understand this section deeply before moving to Part 06 (ERP integration).

API endpoint reference

POST /api/generate — single turn completion

Send a prompt, get a response. No conversation history. Best for single-shot tasks like summarization, classification, or code generation.

POST /api/chat — multi-turn conversation

Send full message history array. Maintains context across multiple turns. Required for chatbot interfaces and conversational workflows.

POST /api/embeddings — generate vector embeddings

Convert text to numerical vectors for RAG pipelines. Uses embedding models like nomic-embed-text. Returns a float array. Used in Part 05.

GET /api/tags — list available models

Returns all models currently downloaded and available on your Ollama server. Use this in your app's model selection dropdown.

POST /api/pull — pull a model via API

Programmatically trigger a model download. Useful for admin dashboards that manage the AI server remotely.

DELETE /api/delete — remove a model

Delete a model from disk via API. Use in admin tools for model lifecycle management.

API call 1 — Simple generate (curl)

Terminal / PowerShell — your first API call

# Linux/Mac terminal:
curl http://localhost:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "prompt": "Summarize the top 3 benefits of using private AI for a company",
    "stream": false
  }'

# Response (formatted for readability):
{
  "model": "llama3.1:8b",
  "created_at": "2024-01-15T08:23:11Z",
  "response": "1. Data Privacy: All processing happens locally...\n
               2. Cost Efficiency: No per-query API fees...\n
               3. Full Control: Customize system prompts...",
  "done": true,
  "total_duration": 8234567890,
  "load_duration": 1234567,
  "prompt_eval_count": 18,
  "eval_count": 127,
  "eval_duration": 7890123456
}

# Windows PowerShell equivalent:
Invoke-RestMethod -Uri "http://localhost:11434/api/generate" `
  -Method POST `
  -ContentType "application/json" `
  -Body '{"model":"llama3.1:8b","prompt":"Hello","stream":false}'

API call 2 — Chat with message history

Terminal — multi-turn chat API call

curl http://localhost:11434/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "stream": false,
    "messages": [
      {
        "role": "system",
        "content": "You are an ERP assistant for Dhaka Traders Ltd. 
                    Always respond professionally. 
                    Answer only work-related questions."
      },
      {
        "role": "user",
        "content": "What should I check first when a customer 
                    complains about a delayed delivery?"
      }
    ]
  }'

# Response:
{
  "model": "llama3.1:8b",
  "message": {
    "role": "assistant",
    "content": "When investigating a delayed delivery complaint, 
    check these items in order:\n
    1. Verify the order status in the ERP system...\n
    2. Check the dispatch date vs promised delivery date...\n
    3. Contact the logistics partner for tracking update...\n
    4. Document everything in the customer interaction log."
  },
  "done": true
}

API call 3 — Streaming response (real-time token display)

Streaming API — tokens arrive one by one as generated

curl http://localhost:11434/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "stream": true,
    "messages": [
      {"role":"user","content":"Explain what RAG means in 2 sentences"}
    ]
  }'

# With stream:true, you receive multiple JSON objects:
# Each line is one token as it is generated:
{"model":"llama3.1:8b","message":{"role":"assistant","content":"RAG"},"done":false}
{"model":"llama3.1:8b","message":{"role":"assistant","content":" stands"},"done":false}
{"model":"llama3.1:8b","message":{"role":"assistant","content":" for"},"done":false}
{"model":"llama3.1:8b","message":{"role":"assistant","content":" Retrieval"},"done":false}
...
{"model":"llama3.1:8b","message":{"role":"assistant","content":""},"done":true,
 "total_duration":4567890123,"eval_count":89}

# This streaming pattern is what makes responses feel "live"
# in browser-based chat interfaces like Open WebUI

API call 4 — Generate embeddings (for RAG)

Embeddings API — used in Part 05 RAG pipeline

curl http://localhost:11434/api/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nomic-embed-text",
    "prompt": "Customer Karim placed an order for 500 units of Product A"
  }'

# Response — a vector of 768 floating point numbers:
{
  "embedding": [
    0.0023064255, -0.014368402, -0.00983825,
    0.035714887, -0.024517306, 0.05992762,
    ... (768 total values)
  ]
}

# This vector mathematically represents the meaning of the sentence
# Similar sentences produce mathematically similar vectors
# This is the foundation of the RAG system in Part 05

API call 5 — List available models

Tags API — build a model selector in your application

curl http://localhost:11434/api/tags

# Response:
{
  "models": [
    {
      "name": "llama3.1:8b",
      "modified_at": "2024-01-15T06:00:00Z",
      "size": 4920000000,
      "digest": "sha256:42182419e950...",
      "details": {
        "parameter_size": "8.0B",
        "quantization_level": "Q4_K_M",
        "family": "llama",
        "format": "gguf"
      }
    },
    {
      "name": "qwen2.5:7b",
      "modified_at": "2024-01-15T07:00:00Z",
      "size": 4400000000,
      ...
    }
  ]
}

Section 10 — Environment variables: Complete configuration reference

Ollama's behavior is controlled through environment variables. These are critical for production deployment. Every company deployment should configure these deliberately — default values are for development only.

OLLAMA_HOST

The IP address and port Ollama listens on. Use localhost for single-machine setups. Use 0.0.0.0 to allow LAN access from other computers.

Default: 127.0.0.1:11434

OLLAMA_MODELS

Directory where model files are stored. Change this to a large disk if your system drive is small. Models are 2–10GB each.

Default: ~/.ollama/models

OLLAMA_KEEP_ALIVE

How long a model stays loaded in memory after the last request. Set to "0" to unload immediately. Set to "-1" to keep forever. "5m" for 5 minutes.

Default: 5m

OLLAMA_NUM_PARALLEL

Maximum number of requests processed simultaneously. Higher = more concurrent users but more VRAM. On 24GB GPU, setting 2–3 is reasonable.

Default: 1

OLLAMA_MAX_LOADED_MODELS

Maximum number of models loaded in memory at once. Set to 1 if VRAM is limited. Set to 2–3 on high-VRAM servers to serve multiple models simultaneously.

Default: 1

OLLAMA_GPU_OVERHEAD

VRAM overhead reserved for the OS and other processes. Increase if getting out-of-memory errors. Measured in bytes.

Default: 0

CUDA_VISIBLE_DEVICES

Specify which GPU(s) Ollama uses (for multi-GPU servers). "0" = first GPU only. "0,1" = use both. "-1" = force CPU mode.

Default: all GPUs

OLLAMA_DEBUG

Enable verbose debug logging. Set to "1" to see detailed inference logs. Useful when troubleshooting slow performance or GPU detection issues.

Default: 0 (disabled)

Recommended production configuration

Production environment variable configuration — company server

# Linux — /etc/systemd/system/ollama.service.d/override.conf
# Windows — System Properties → Environment Variables

[Service]

# Bind to all interfaces so LAN clients can reach the server
Environment="OLLAMA_HOST=0.0.0.0:11434"

# Store models on a dedicated data drive
Environment="OLLAMA_MODELS=/data/ollama/models"

# Keep model in memory for 10 minutes after last use
# Prevents slow reload between queries from different users
Environment="OLLAMA_KEEP_ALIVE=10m"

# Allow 2 parallel requests (good for 24GB VRAM)
Environment="OLLAMA_NUM_PARALLEL=2"

# Only allow 1 model loaded at a time (for VRAM conservation)
Environment="OLLAMA_MAX_LOADED_MODELS=1"

# Reserve 512MB VRAM for OS (prevents out-of-memory crashes)
Environment="OLLAMA_GPU_OVERHEAD=536870912"

# Apply changes:
sudo systemctl daemon-reload
sudo systemctl restart ollama
systemctl status ollama

Section 11 — Enabling LAN access: Let your entire team use the AI server

By default, Ollama only accepts connections from the same machine (localhost). To allow employees on your company network to send requests to the AI server, you need to bind Ollama to the server's network interface.

Security warning before enabling LAN access

Exposing Ollama on your LAN without authentication means ANY device on your network can send unlimited requests to your AI server. For initial testing this is fine. For production, always put NGINX with authentication in front of Ollama. We cover this fully in Part 10 (Security). For now, ensure your Ollama port is not exposed to the internet — only your internal company network.

Enable LAN access — Linux systemd

# Edit Ollama service environment:
sudo systemctl edit ollama

# Add this line in the [Service] section:
Environment="OLLAMA_HOST=0.0.0.0:11434"

sudo systemctl daemon-reload
sudo systemctl restart ollama

# Get your server's LAN IP address:
ip addr show | grep "inet " | grep -v 127.0.0.1
# Example: inet 192.168.1.100/24

# Test from another machine on the same network:
curl http://192.168.1.100:11434
# Expected: Ollama is running

# Open firewall port (Ubuntu with ufw):
sudo ufw allow from 192.168.1.0/24 to any port 11434
sudo ufw status

Enable LAN access — Windows

# Set environment variable (as Administrator):
[System.Environment]::SetEnvironmentVariable(
  "OLLAMA_HOST", "0.0.0.0:11434", "Machine")

# Restart Ollama service:
Restart-Service Ollama

# Allow through Windows Firewall:
New-NetFirewallRule -DisplayName "Ollama LAN Access" `
  -Direction Inbound `
  -Protocol TCP `
  -LocalPort 11434 `
  -RemoteAddress LocalSubnet `
  -Action Allow

# Get your Windows server's IP:
ipconfig | findstr "IPv4"
# IPv4 Address: 192.168.1.100

# Test from another machine:
# Open browser → http://192.168.1.100:11434
# Should show: Ollama is running

Section 12 — Troubleshooting: 10 common errors and how to fix them

Error 1 — 'ollama' is not recognized as a command

Cause: Ollama binary not in system PATH. Usually means installer did not complete or PATH was not refreshed.

Fix (Windows): Close all terminal windows and open a NEW PowerShell window. PATH changes only apply to new sessions. If still failing: System Properties → Environment Variables → verify C:\Users\[Name]\AppData\Local\Programs\Ollama is in your PATH.

Fix (Linux): Run source ~/.bashrc or log out and back in. Then which ollama should return /usr/local/bin/ollama.

Error 2 — Model runs on CPU instead of GPU (very slow)

Cause: NVIDIA drivers not installed or CUDA not detected by Ollama.

Diagnose: Run ollama run llama3.1:8b, then in another terminal run nvidia-smi and watch GPU memory — if memory does not increase, GPU is not being used.

Fix: Install/update NVIDIA drivers. Restart Ollama service. On Linux: sudo apt install nvidia-driver-535 -y && sudo reboot. After reboot, verify with nvidia-smi then ollama ps should show "GPU" in the PROCESSOR column.

Error 3 — "out of memory" or model crashes immediately

Cause: Model requires more VRAM than available on your GPU.

Fix Option A: Switch to a smaller model. If 7B fails, try 2B or 3B. Run ollama pull gemma2:2b as a test.

Fix Option B: Force CPU mode temporarily: CUDA_VISIBLE_DEVICES="" ollama run llama3.1:8b. Slower but works on any machine.

Fix Option C: Use a more aggressively quantized version: ollama pull llama3.1:8b-instruct-q4_0 instead of the default Q4_K_M.

Error 4 — API returns "connection refused" at port 11434

Cause: Ollama service is not running.

Fix (Windows): Start-Service Ollama in PowerShell as Administrator. Or search "Ollama" in Start Menu and click the app.

Fix (Linux): sudo systemctl start ollama. Then systemctl status ollama to verify. If it fails to start, check logs: journalctl -u ollama -f.

Error 5 — Model pull fails or download stalls

Cause: Network issue, insufficient disk space, or Ollama registry timeout.

Fix: Check disk space first: df -h (Linux) or Get-PSDrive C (Windows). Need at least 2x model size free during download. If stalled, press Ctrl+C and re-run ollama pull — it resumes from where it stopped. Check DNS: try ping ollama.com to verify connectivity.

Error 6 — Responses are extremely slow (under 1 token/sec)

Cause: Model is running on CPU instead of GPU, or model is too large for available VRAM and is partially swapping to RAM.

Diagnose: Run ollama ps during inference — check PROCESSOR column. Should say "GPU". If it says "CPU", the GPU is not being used. Check nvidia-smi for VRAM usage.

Fix: Switch to smaller model, update GPU drivers, or add OLLAMA_GPU_OVERHEAD=536870912 environment variable to reserve VRAM for stability.

Error 7 — "context length exceeded" in API response

Cause: Your prompt + conversation history exceeds the model's context window.

Fix: Either reduce the amount of text in your prompt, truncate conversation history in multi-turn chats, or switch to a model with a larger context window (LLaMA 3.1 8B supports 128K tokens). In the API, set "num_ctx": 4096 explicitly in your request to avoid surprises.

Error 8 — LAN clients cannot reach the Ollama server

Cause: OLLAMA_HOST not set to 0.0.0.0, or firewall blocking port 11434.

Fix: Verify OLLAMA_HOST environment variable is set: systemctl show ollama | grep Environment. Check firewall: sudo ufw status. Ensure port 11434 is allowed from your LAN subnet. Test from the server itself first: curl http://0.0.0.0:11434.

Error 9 — Bengali text appears garbled or as boxes

Cause: Terminal or application does not support Unicode UTF-8 encoding.

Fix (Windows terminal): Run chcp 65001 in Command Prompt to switch to UTF-8. Or use Windows Terminal app instead of CMD — it handles Unicode natively. In PowerShell: [Console]::OutputEncoding = [System.Text.Encoding]::UTF8.

Fix (API/app): Ensure your HTTP client sends Accept-Charset: utf-8 and your response parser handles UTF-8. In C#, use Encoding.UTF8 when reading the response stream.

Error 10 — Model loaded but not using full GPU capacity

Cause: Model is partially offloaded to CPU because VRAM is insufficient for the full model at current quantization.

Diagnose: Run ollama show llama3.1:8b and look at the "num_gpu" parameter. If it shows a fraction, the model is split across GPU and CPU.

Fix: Use a more aggressively quantized version (Q4_0 instead of Q8_0). Or switch to a smaller model that fits fully in VRAM. A fully GPU-resident model is 3–10x faster than a split GPU/CPU model.

Section 13 — Do's, Don'ts, and limitations

Do — best practices

Start with gemma2:2b to test your setup
Upgrade to llama3.1:8b once hardware confirmed
Pull nomic-embed-text now for later RAG use
Create a Modelfile for your company persona
Use Q4_K_M quantization as your default
Monitor GPU memory with nvidia-smi
Set OLLAMA_KEEP_ALIVE for production servers
Store models on a dedicated large drive
Test API with curl before writing application code
Document which model version is deployed

Don't — avoid these mistakes

Don't run 7B+ models on 8GB RAM without GPU
Don't expose port 11434 to the internet directly
Don't ignore GPU driver setup before install
Don't use Q2 quantization for production tasks
Don't use the same model for all use cases
Don't skip testing the API before ERP integration
Don't set OLLAMA_NUM_PARALLEL too high (VRAM risk)
Don't store models on the OS system drive
Don't assume AI output is always correct — validate
Don't run two large models loaded simultaneously

Honest limitations of local AI at this stage

What local AI cannot do well (yet)

// Speed comparison — local 7B vs GPT-4 (cloud)
Local LLaMA 3.1 8B on RTX 4090:   ~80-120 tokens/sec
Local LLaMA 3.1 8B on RTX 3060:   ~30-50  tokens/sec
Local LLaMA 3.1 8B on CPU only:   ~3-8    tokens/sec
GPT-4 Turbo (cloud):               ~50-100 tokens/sec

// Quality comparison — approximate task accuracy
Complex reasoning:   Local 8B = ~75% of GPT-4 quality
Simple Q&A:          Local 8B = ~90% of GPT-4 quality
Code generation:     Local 8B = ~80% of GPT-4 quality
Bengali quality:     Local 8B = ~65% of GPT-4 quality (improving)
Long documents:      Local 8B = ~70% of GPT-4 quality

// What local AI is NOT suitable for (at this model size):
// - Real-time complex financial modeling
// - Advanced legal document analysis
// - Medical diagnosis support
// - Tasks requiring internet search (no built-in web access)
// - Vision/image analysis (standard LLaMA — use LLaVA for images)

#Ollama#LocalAI#LLaMA3#Mistral7B#Gemma2#DeepSeek#Phi4#Qwen25#PrivateAI#OllamaInstall#OllamaAPI#RunLLMLocally#LocalLLM#SelfHostedAI#OllamaWindows#OllamaLinux#OllamaUbuntu#AISetup#ModelQuantization#GGUF#llama.cpp#FreeLearning365#AIForBusiness#BangladeshTech#OllamaAPI#OllamaModelfile

Coming next

Part 02 — Browser Chat UI: Give Every Employee an AI Interface

Install Open WebUI with Docker, configure multi-user access, set company branding, and give every employee a polished browser-based AI chat system — no technical knowledge required to use it.

FreeLearning365

Install Ollama & Run Your First Local AI Model: Complete Hands-On Guide [PART 01]

Install Ollama & Run Your First Local AI Model: Complete Hands-On Guide

Section 1 — What is Ollama and how does it actually work?

Section 2 — System requirements: What hardware do you actually need?

Checking your GPU before installation

Section 3 — Install Ollama on Windows 10/11: Full walkthrough

Step 1 — Download the Windows installer

Step 2 — Run the installer

Step 3 — Verify the Ollama service is running

Step 4 — Configure model storage location (optional but recommended)

Section 4 — Install Ollama on Ubuntu/Linux: Full walkthrough

Step 1 — Prerequisites check

Step 2 — Install Ollama (one command)

Step 3 — Verify installation and service

Step 4 — Configure model storage on Linux

Section 5 — Understanding every major available model

What are model parameters?

Quantization — how to run large models on small hardware

Complete model reference — every major model explained

Section 6 — Pulling and running your first model

Pulling a model (downloading)

Running a model interactively

Useful interactive commands

Section 7 — Ollama CLI: Every command you need to know

Section 8 — Creating a custom Modelfile: Give AI your company's personality

Your first company Modelfile

Section 9 — The REST API: Making your first API call

API endpoint reference

API call 1 — Simple generate (curl)

API call 2 — Chat with message history

API call 3 — Streaming response (real-time token display)

API call 4 — Generate embeddings (for RAG)

API call 5 — List available models

Section 10 — Environment variables: Complete configuration reference

Recommended production configuration

Section 11 — Enabling LAN access: Let your entire team use the AI server

Section 12 — Troubleshooting: 10 common errors and how to fix them

Section 13 — Do's, Don'ts, and limitations

Honest limitations of local AI at this stage

Post a Comment

0 Comments

Contact Us

Social Plugin

Popular Posts

The Ultimate Flutter Enterprise Architecture Guide for Complex Mobile Applications

SAP Plan to Produce (PP) – Realistic Production Planning & Shop Floor Execution Hands‑On Tutorial | Part 09 | FreeLearning365

English for Bangladesh: From Fundamentals to Fluency (5 of 20): Module 5: The Sentence Superstructure – Your Key to Confident English

Mobile App

Categories

Free Learning 365

Random Posts

Social Media

Recent in Courses

Popular Posts

The Ultimate Flutter Enterprise Architecture Guide for Complex Mobile Applications

SAP Plan to Produce (PP) – Realistic Production Planning & Shop Floor Execution Hands‑On Tutorial | Part 09 | FreeLearning365

English for Bangladesh: From Fundamentals to Fluency (5 of 20): Module 5: The Sentence Superstructure – Your Key to Confident English

Menu Footer Widget