Your In-Depth Guide to Building a Secure & Scalable Internal AI Platform (June 2025 Edition) • Olger Chotza

The Vision: Your Private AI Powerhouse#

The desire for bespoke, secure, and scalable AI solutions within organizations is no longer a futuristic dream but a present-day necessity. This guide provides a detailed roadmap to construct a robust internal AI platform. Doing so grants your organization the transformative power of modern Large Language Models (LLMs) while ensuring complete data sovereignty, security, and control over your AI destiny.

Our goal is to build a company-controlled AI ecosystem that offers:

Secure, internal access to powerful LLMs.
An intuitive chat interface for employees.
The ability to "chat with your documents" using Retrieval Augmented Generation (RAG).
A solid foundation for developing and deploying custom fine-tuned AI models.

Who is this guide for? This guide is primarily for IT teams, DevOps engineers, and system administrators who have experience with server administration, Linux, networking, and ideally, some familiarity with containerization and orchestration concepts. While we aim for clarity, a foundational understanding of these areas will be highly beneficial.

Core Technology Stack & Rationale:

Foundation: Dedicated Server(s) with powerful GPU(s) – Essential for LLM performance.
Orchestration: Kubernetes (K3s recommended) – For robust, scalable, and resilient management of our AI services. K3s offers a lightweight yet fully compliant Kubernetes experience, ideal for on-premise and quicker setup.
Containerization: Docker – The industry standard for packaging applications and their dependencies.
Secure Access & Ingress: VPN (e.g., WireGuard) & Nginx Proxy Manager (NPM) – VPN for secure network-level access, NPM for user-friendly HTTPS subdomains and reverse proxying to internal services.
AI Model Serving: Ollama – A user-friendly tool for running open-source LLMs locally, with excellent GPU support.
Chat Interface: Open WebUI – A feature-rich, self-hosted web UI that integrates seamlessly with Ollama.
RAG Backend: Vector Database (e.g., Qdrant, ChromaDB) – To store document embeddings efficiently for fast retrieval in RAG systems.
Efficient Fine-Tuning: Unsloth – An innovative library to significantly speed up LLM fine-tuning and reduce memory usage, making custom model development more accessible.

Disclaimer: This is an advanced blueprint. Specific commands, configurations, and software versions will evolve. Always consult the official documentation for each component and adapt instructions to your specific environment and security policies.

Phase 1: Laying the Foundational Infrastructure#

A robust foundation is critical. This phase covers server preparation, OS setup, and core software installation.

1.1. Server(s) Preparation: The Bedrock of Your AI#

Hardware Selection – Don't Skimp Here!
- CPU: Modern multi-core Intel Xeon or AMD EPYC/Ryzen. Needed for general system operations and supporting AI workloads.
- RAM: Minimum 64GB, but 128GB+ is strongly recommended. LLMs can be memory-hungry, especially when running multiple models or handling large RAG databases. More RAM also helps if GPU VRAM is a constraint.
- GPU (Graphics Processing Unit): The single most critical component for LLM performance.
  - NVIDIA: The preferred choice due to mature CUDA drivers and broad software support (Ollama, PyTorch, Unsloth, etc.). Aim for series like RTX 30xx/40xx, or professional A-series (A4000, A6000) / H-series (H100, L40S) for larger budgets.
  - VRAM is King: Ensure ample VRAM (Video RAM) on your GPUs. 16GB is a bare minimum for smaller models; 24GB-48GB+ per GPU is ideal for comfortably running larger, more capable models and for fine-tuning.
  - AMD: ROCm support is maturing but can require more configuration. Carefully check compatibility with Ollama, PyTorch, and Kubernetes device plugins.
- Storage: Fast NVMe SSDs are crucial for OS responsiveness, quick container image loading, and fast model access. 1TB+ is a good starting point, but factor in space for multiple large models (some are 50GB+), RAG indexes, and system logs.
- Networking: 1Gbps NIC is a minimum. For multi-node clusters or heavy usage, 10Gbps+ is recommended to avoid bottlenecks.
- Initial Strategy: Starting with one powerful server can simplify initial setup. You can architect for multi-node expansion later.
Operating System:
- A stable Linux distribution is key. Ubuntu Server 22.04 LTS or 24.04 LTS are excellent choices due to wide community support and compatibility. RHEL or Debian are also viable.
Initial Server Setup & Security Hardening:
- Install your chosen OS.
- Update all packages immediately: sudo apt update && sudo apt upgrade -y.
- Configure static IP addresses for your server(s) for stable network identity.
- Implement a firewall (e.g., ufw on Ubuntu). Only allow necessary ports: SSH (non-standard port recommended), HTTP/HTTPS for NPM, your VPN port, and Kubernetes-specific ports as required by your setup.
- Secure SSH: Disable root login, enforce key-based authentication, and consider tools like Fail2ban.
NVIDIA Drivers & Toolkit (Crucial Prerequisite):
- If using NVIDIA GPUs, install the proprietary NVIDIA drivers and the nvidia-container-toolkit before installing Docker or Kubernetes. This toolkit allows containers to access NVIDIA GPUs. Follow NVIDIA's official documentation meticulously.

1.2. Docker Installation: The Container Engine#

Docker packages our AI services into portable containers.

Install Docker Engine using the official guide: https://docs.docker.com/engine/install/

# Example for Ubuntu (always verify with official Docker documentation)
sudo apt-get update
sudo apt-get install -y apt-transport-https ca-certificates curl gnupg software-properties-common
curl -fsSL [https://download.docker.com/linux/ubuntu/gpg](https://download.docker.com/linux/ubuntu/gpg) | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] [https://download.docker.com/linux/ubuntu](https://download.docker.com/linux/ubuntu) $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
sudo apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
sudo systemctl start docker
sudo systemctl enable docker
# Allow your user to run Docker commands without sudo (requires logout/login or 'newgrp docker')
sudo usermod -aG docker $USER
newgrp docker # Activates the group change for the current session

Tip: After installation, verify Docker is running with docker --version and sudo systemctl status docker.

1.3. Kubernetes Cluster Setup (using K3s)#

K3s is a lightweight, certified Kubernetes distribution ideal for on-premise setups, edge computing, and development. It's simpler to install and manage than full K8s.

Why K3s? It bundles essential components, has a small footprint, and offers a straightforward path to a functional cluster.

Installing K3s (Single Master Node): We'll configure K3s to use Docker as its container runtime (ensure Docker is already NVIDIA-aware if you have GPUs) and disable K3s's built-in Traefik ingress, as we'll use Nginx Proxy Manager.

# Ensure NVIDIA drivers and nvidia-container-toolkit are installed and Docker is configured for NVIDIA runtime if applicable.
curl -sfL [https://get.k3s.io](https://get.k3s.io) | INSTALL_K3S_EXEC="--docker --disable=traefik" sh -s - --write-kubeconfig-mode 644

Configure kubectl Access (Your Kubernetes Command-Line Tool):

sudo mkdir -p $HOME/.kube
sudo cp /etc/rancher/k3s/k3s.yaml $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
export KUBECONFIG=$HOME/.kube/config # Add this line to your shell profile (e.g., .bashrc, .zshrc)

Verify access: kubectl get nodes -o wide (should show your master node).

Enabling NVIDIA GPU Support in Kubernetes: The NVIDIA GPU Operator automates the management of NVIDIA GPU resources in Kubernetes.

helm repo add nvidia [https://helm.ngc.nvidia.com/nvidia](https://helm.ngc.nvidia.com/nvidia) && helm repo update
# This command installs the operator which then deploys necessary components like device plugins.
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator --create-namespace --wait

Verify GPU availability to Kubernetes: kubectl describe nodes | grep nvidia.com/gpu. You should see allocatable GPU resources.

Persistent Storage:
- K3s includes a Local Path Provisioner by default. This uses directories on the host node for persistent storage, which is fine for a single-node cluster.
- For multi-node clusters or production setups: You'll need a robust network storage solution like NFS, Ceph, Rook, or a cloud-native option like Longhorn (https://longhorn.io/).

1.4. VPN Setup: Secure Network Access#

A VPN (Virtual Private Network) is essential for secure administrative access and potentially for users to connect to your internal AI platform as if they were on the local network.

Options:
- WireGuard: Modern, fast, and generally simpler to configure. Consider using wg-easy (a Docker image providing a web UI for WireGuard).
- OpenVPN: A mature and robust option, though potentially more complex.
Setup Steps:
1. Install VPN server software on a dedicated VM, a container, or an infrastructure server (ensure firewall rules are updated).
2. Configure server settings (IP ranges, DNS for clients) and generate client profiles/keys.
3. Configure your external firewall/router to forward the VPN port (e.g., UDP 51820 for WireGuard) to your VPN server.
4. Ensure VPN clients are assigned IP addresses that can route to your Kubernetes nodes and internal service IPs.

Phase 2: Deploying Core AI Services on Kubernetes#

We'll deploy Nginx Proxy Manager, Ollama, and Open WebUI. It's good practice to use separate Kubernetes namespaces for better organization and resource management.

General Kubernetes Workflow for Each Service:

Create a Namespace: kubectl create namespace <namespace-name>
Define a PersistentVolumeClaim (PVC) for any stateful data.
Define a Deployment to manage the application pods.
Define a Service to provide a stable internal network endpoint for the Deployment.

2.1. Nginx Proxy Manager (NPM): Your Secure Gateway#

NPM will handle SSL/TLS termination (HTTPS) and act as a reverse proxy, routing requests from user-friendly subdomains to the correct internal services.

Create Namespace: kubectl create namespace npm

NPM Persistent Storage (npm-pvc.yaml): Stores NPM configuration and Let's Encrypt certificates.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: npm-data-pvc
  namespace: npm
spec:
  accessModes: ["ReadWriteOnce"] # Suitable for a single NPM pod
  resources:
    requests:
      storage: 10Gi # Adjust as needed

Apply: kubectl apply -f npm-pvc.yaml -n npm

NPM Deployment & Service (npm-deployment.yaml):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-proxy-manager
  namespace: npm
spec:
  replicas: 1
  selector:
    matchLabels:
      app: npm
  template:
    metadata:
      labels:
        app: npm
    spec:
      containers:
      - name: npm-app
        image: 'jc21/nginx-proxy-manager:latest' # Consider pinning to a specific version tag for stability
        ports:
        - containerPort: 80  # HTTP
        - containerPort: 81  # Admin UI
        - containerPort: 443 # HTTPS
        volumeMounts:
        - name: npm-data
          mountPath: /data
        - name: npm-letsencrypt
          mountPath: /etc/letsencrypt
      volumes:
      - name: npm-data
        persistentVolumeClaim:
          claimName: npm-data-pvc
      - name: npm-letsencrypt # Often stored within the /data volume by NPM itself
        persistentVolumeClaim: # but can be explicitly managed if needed
          claimName: npm-data-pvc # Example: using the same PVC, or a subPath
---
apiVersion: v1
kind: Service
metadata:
  name: npm-service
  namespace: npm
spec:
  type: LoadBalancer # For on-prem K3s without a cloud provider, you might need MetalLB ([https://metallb.universe.tf/](https://metallb.universe.tf/)) for this to get an external IP. Alternatively, use NodePort.
  selector:
    app: npm
  ports:
  - name: http
    port: 80
    targetPort: 80
  - name: https
    port: 443
    targetPort: 443
  - name: admin
    port: 81 # Expose admin UI externally
    targetPort: 81

Apply: kubectl apply -f npm-deployment.yaml -n npm

Initial NPM Setup:
1. Access the NPM admin UI. If using LoadBalancer, find its External IP (kubectl get svc -n npm npm-service). If NodePort, use <NodeIP>:<NodePortFor81>.
2. Default login: admin@example.com / changeme. Change this immediately!
DNS Configuration: In your internal (or public, if applicable) DNS, create CNAME or A records for your desired subdomains (e.g., chat.internal.yourcompany.com) pointing to the external IP of the npm-service.

Tip: DNS changes can take time to propagate.

2.2. Ollama: Serving Your LLMs#

Ollama makes running open-source LLMs straightforward.

Create Namespace: kubectl create namespace ollama

Ollama Models Storage (ollama-pvc.yaml): Downloaded models are large.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ollama-models-pvc
  namespace: ollama
spec:
  accessModes: ["ReadWriteOnce"]
  resources:
    requests:
      storage: 200Gi # Adjust based on how many models and their sizes

Apply: kubectl apply -f ollama-pvc.yaml -n ollama

Ollama Deployment & Service (ollama-deployment.yaml):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
  namespace: ollama
spec:
  replicas: 1 # For multiple GPUs/nodes, scaling requires careful model management or shared storage for models.
  strategy:
    type: Recreate # Often preferred for GPU workloads to ensure clean release/acquisition of GPU resources during updates.
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      containers:
      - name: ollama
        image: ollama/ollama:latest # Pin to a specific version for production
        ports:
        - containerPort: 11434
        volumeMounts:
        - name: ollama-models
          mountPath: /root/.ollama # Ollama's default model storage path
        resources: # CRITICAL for GPU access
          limits:
            [nvidia.com/gpu](https://nvidia.com/gpu): 1 # Request 1 GPU from Kubernetes
          # requests: # Optionally, set requests equal to limits for guaranteed allocation
          #   [nvidia.com/gpu](https://nvidia.com/gpu): 1
        # Consider adding liveness and readiness probes for better health management.
      volumes:
      - name: ollama-models
        persistentVolumeClaim:
          claimName: ollama-models-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: ollama-service
  namespace: ollama
spec:
  type: ClusterIP # This service is internal; Open WebUI will access it via this ClusterIP.
  selector:
    app: ollama
  ports:
  - port: 11434
    targetPort: 11434

Apply: kubectl apply -f ollama-deployment.yaml -n ollama

Pulling Your First Model:

# Get the name of your Ollama pod
OLLAMA_POD=$(kubectl get pods -n ollama -l app=ollama -o jsonpath='{.items[0].metadata.name}')
# Execute the pull command inside the pod
kubectl exec -it -n ollama $OLLAMA_POD -- ollama pull llama3:8b # Example: Llama 3 8B
# Check Ollama logs for pull progress and GPU detection:
kubectl logs -n ollama -f $OLLAMA_POD

2.3. Open WebUI: The Chat Interface#

Open WebUI provides a user-friendly interface to interact with models served by Ollama.

Create Namespace: kubectl create namespace open-webui

Open WebUI Data Storage (open-webui-pvc.yaml): For configuration, user data, RAG history etc.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: open-webui-data-pvc
  namespace: open-webui
spec:
  accessModes: ["ReadWriteOnce"]
  resources:
    requests:
      storage: 20Gi # Adjust as needed

Apply: kubectl apply -f open-webui-pvc.yaml -n open-webui

Open WebUI Deployment & Service (open-webui-deployment.yaml):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: open-webui
  namespace: open-webui
spec:
  replicas: 1
  selector:
    matchLabels:
      app: open-webui
  template:
    metadata:
      labels:
        app: open-webui
    spec:
      containers:
      - name: open-webui
        image: ghcr.io/open-webui/open-webui:main # Pin to a specific version tag for production
        ports:
        - containerPort: 8080 # Default port for Open WebUI
        env:
        # This is crucial: points Open WebUI to your internal Ollama Kubernetes service.
        - name: OLLAMA_BASE_URL
          value: "[http://ollama-service.ollama.svc.cluster.local:11434](http://ollama-service.ollama.svc.cluster.local:11434)"
        # Add other Open WebUI environment variables as needed (e.g., for enabling RAG, authentication)
        volumeMounts:
        - name: open-webui-data
          mountPath: /app/backend/data # Verify this path in Open WebUI's official documentation
      volumes:
      - name: open-webui-data
        persistentVolumeClaim:
          claimName: open-webui-data-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: open-webui-service
  namespace: open-webui
spec:
  type: ClusterIP # Internal service; will be exposed via Nginx Proxy Manager.
  selector:
    app: open-webui
  ports:
  - port: 8080 # The port this service listens on
    targetPort: 8080 # The port the container is listening on

Apply: kubectl apply -f open-webui-deployment.yaml -n open-webui

Expose Open WebUI via Nginx Proxy Manager:
1. Navigate to your NPM admin UI.
2. Click "Proxy Hosts" then "Add Proxy Host".
3. Domain Names: Enter your chosen subdomain (e.g., chat.internal.yourcompany.com).
4. Scheme: http (NPM handles SSL/TLS).
5. Forward Hostname / IP: open-webui-service.open-webui.svc.cluster.local (This is the Kubernetes internal DNS name for the Open WebUI service).
6. Forward Port: 8080 (the port specified in the Open WebUI service).
7. Enable: "Block Common Exploits" and crucially "Websockets Support" (WebUI uses websockets).
8. Go to the SSL tab:
  - Select "Request a new SSL certificate".
  - Enable "Force SSL" and "HTTP/2 Support".
  - If internal.yourcompany.com is a subdomain of a public domain you own and can manage via a supported DNS provider, use the "DNS Challenge" for Let's Encrypt. This is the most robust way to get certs for internal names.
9. Save. You should now be able to access Open WebUI at https://chat.internal.yourcompany.com.

Phase 3: Unlocking Advanced AI Capabilities#

With the core services running, let's explore advanced features.

3.1. Retrieval Augmented Generation (RAG): Chat with Your Company's Documents#

RAG allows your LLM to access and cite information from your private documents, providing contextually relevant answers.

Vector Database on Kubernetes (e.g., Qdrant): Vector databases store "embeddings" – numerical representations of your document content – allowing for fast similarity searches.
- Create Namespace: kubectl create namespace ai-db
- Deploy using its official Helm chart (recommended for ease of management):
```
helm repo add qdrant [https://qdrant.github.io/qdrant-helm](https://qdrant.github.io/qdrant-helm)
helm install qdrant qdrant/qdrant -n ai-db \
  --set persistence.enabled=true \
  --set persistence.size=50Gi # Adjust storage based on expected document volume
```
  This creates a service like qdrant.ai-db.svc.cluster.local.
Integrating RAG with Open WebUI:
1. Open WebUI has built-in RAG support. Access its admin settings panel.
2. Document Preprocessing & Ingestion:
  - Preprocessing is key! Clean your documents, ensure good formatting, and consider how they will be split into chunks. Quality in, quality out.
  - Upload documents (PDF, TXT, MD, etc.) through the Open WebUI interface.
3. Embedding Model:
  - An embedding model converts text chunks into vectors. Examples: nomic-embed-text, mxbai-embed-large. Ensure your chosen model is pulled to your Ollama instance:
```
# OLLAMA_POD should be set as in section 2.2
kubectl exec -it -n ollama $OLLAMA_POD -- ollama pull nomic-embed-text
```
  - In Open WebUI settings, configure it to use this embedding model via your Ollama service.
4. Vector DB Connection: Configure Open WebUI to connect to your Qdrant service URL (e.g., http://qdrant.ai-db.svc.cluster.local:6333 – Qdrant's default gRPC port is 6333, HTTP is 6334). Check Qdrant's service details.
For More Control: Custom RAG Pipelines: For highly tailored RAG, consider building custom applications using frameworks like LangChain or LlamaIndex. These applications would also be deployed as services in Kubernetes, interacting with Ollama and your vector DB.

3.2. Efficient Model Fine-Tuning with Unsloth#

Fine-tuning adapts a pre-trained LLM to your company's specific jargon, style, or tasks. Unsloth significantly optimizes this process, especially for LoRA (Low-Rank Adaptation) fine-tuning.

Fine-Tuning: A Separate, Resource-Intensive Workflow:
- This is not done within Ollama but is a preparatory step to create a custom model.
- It's iterative and requires experimentation.
Dataset Preparation: The Most Critical Part!
- High-quality, domain-specific datasets are paramount. Format them appropriately (e.g., instruction-response pairs, conversational data). Garbage in, garbage out.
Environment for Fine-Tuning:
- Create a dedicated Docker image containing: Python, PyTorch (GPU-enabled), Unsloth, Hugging Face libraries (transformers, datasets, peft, trl), and your training scripts.
- Install Unsloth (refer to their GitHub for the latest commands for your CUDA/ROCm version):
  - NVIDIA CUDA example: pip install "unsloth[cu121-ampere-torch212]" (Tailor to your GPU architecture and PyTorch version)
  - AMD ROCm example: pip install "unsloth[rocm_6_0-mi200-torch212]"
Running Fine-Tuning Jobs in Kubernetes:
- Define a Kubernetes Job that runs your fine-tuning script in a pod with dedicated GPU resources.
- The job should:
  1. Mount your datasets (e.g., from a PVC or cloud storage).
  2. Execute the Unsloth-optimized training script.
  3. Save the resulting fine-tuned model adapter (e.g., LoRA weights) or the fully merged model to persistent storage.

Example Unsloth Training Snippet (Conceptual - see Unsloth docs for details):

from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-bnb-4bit", # Choose a model supported by Unsloth
    max_seq_length = 2048, # Or your desired sequence length
    load_in_4bit = True,   # Or load_in_8bit=True for better quality if VRAM allows
)
# Apply LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # LoRA rank (e.g., 8, 16, 32) - higher can mean more learnable parameters but more VRAM
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], # Common Llama modules
    lora_alpha = 16, # Often set equal to r
    lora_dropout = 0, # Or a small value like 0.05 or 0.1
    bias = "none",
    use_gradient_checkpointing = True, # Saves memory
)
# ... (Load your dataset using Hugging Face datasets)
# ... (Configure and run Hugging Face Trainer with your Unsloth model)
# model.save_pretrained("lora_model_output") # Saves LoRA adapters

Importing Fine-Tuned Models into Ollama:

Merge LoRA Adapters (if applicable): If you trained with LoRA, you typically need to merge these adapters into the base model to create a single, deployable model. Unsloth provides methods for this, or you can use Hugging Face PEFT.
Convert to GGUF Format: Ollama primarily uses the GGUF model format. Tools like llama.cpp provide scripts to convert Hugging Face models (especially after merging LoRA) to GGUF. This step often involves quantization, which reduces model size and can speed up inference, sometimes with a small quality trade-off.

Create an Ollama Modelfile: This file tells Ollama how to load and use your custom GGUF model.

# Example for a Llama 3 fine-tune
FROM ./your-finetuned-company-model.gguf # Path to the GGUF file (relative to the Modelfile location when creating)

# Define the prompt template your model was trained with or expects
TEMPLATE """<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|><|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{{ .Response }}<|eot_id|>"""

# Set default parameters
PARAMETER temperature 0.6
PARAMETER top_k 40
PARAMETER top_p 0.9

# Optionally, set a default system prompt
SYSTEM """You are a helpful AI assistant specialized in [Your Company's Domain]."""

Make Model Available to Ollama:
- Copy the GGUF file and the Modelfile into a directory accessible by your Ollama instance (e.g., onto its PVC, or exec into the pod and place it in /root/.ollama/models or a temporary location).

Create the Model in Ollama:

# OLLAMA_POD should be set as in section 2.2
# Assume Modelfile and GGUF are in /tmp inside the pod for this example:
kubectl cp ./Modelfile ${OLLAMA_POD}:/tmp/Modelfile -n ollama
kubectl cp ./your-finetuned-company-model.gguf ${OLLAMA_POD}:/tmp/your-finetuned-company-model.gguf -n ollama

kubectl exec -it -n ollama $OLLAMA_POD -- ollama create your-company-model -f /tmp/Modelfile

Your custom model your-company-model is now available via Ollama and Open WebUI!

Phase 4: Security, Operations, and Best Practices – The Long Game#

Building is one thing; maintaining and securing is another.

Kubernetes Security Best Practices:
- RBAC (Role-Based Access Control): Implement least privilege. Define specific Roles and RoleBindings for users and service accounts.
- Network Policies: Restrict pod-to-pod communication. For example, only allow Open WebUI pods to connect to Ollama pods, and only specific services to connect to the vector DB.
- Secrets Management: Use Kubernetes Secrets for all sensitive data like API keys, passwords, and certificates. Consider integrating with Vault for advanced secret management.
- Pod Security Standards / PodSecurityPolicies (or their successor): Apply appropriate security contexts to your pods to limit their capabilities.
- Image Scanning: Integrate tools to scan your container images for vulnerabilities.
HTTPS Everywhere: Enforced by Nginx Proxy Manager using valid SSL/TLS certificates. Regularly renew certificates.
VPN Access: Continue to use strong VPN practices for administrative tasks and secure remote access.
Consistent Updates & Patch Management: CRITICAL!
- Regularly update: OS, Kubernetes (K3s), Docker, NVIDIA drivers, and all deployed applications (NPM, Ollama, Open WebUI, Vector DB, etc.).
- Pin image versions: In your Kubernetes manifests, use specific version tags for all container images (e.g., ollama/ollama:0.1.40 instead of ollama/ollama:latest). This prevents unexpected breaking changes from :latest tag updates.
- Subscribe to security advisories for all software components.
Ollama API Security: If you choose to expose the Ollama API directly via NPM (not just through Open WebUI), implement an authentication layer in NPM (e.g., Basic Auth, Forward Auth to an IdP) or use stricter Kubernetes network policies.
Monitoring and Logging: Your Eyes and Ears
- Kubernetes Cluster & Application Monitoring: Deploy Prometheus and Grafana (the kube-prometheus-stack Helm chart is excellent) for metrics on nodes, pods, deployments, and services.
- GPU Monitoring: The NVIDIA GPU Operator often includes DCGM (Data Center GPU Manager) exporter, or you can deploy dcgm-exporter separately to feed detailed GPU metrics (utilization, memory, temperature) into Prometheus.
- Centralized Logging: Implement a logging stack like the EFK Stack (Elasticsearch, Fluentd/Fluentbit, Kibana) or Grafana Loki to aggregate logs from all Kubernetes pods for easier troubleshooting and auditing.
Data Backup and Recovery Strategy:
- Kubernetes State:
  - For K3s using embedded SQLite: Regularly back up the K3s server data directory (e.g., /var/lib/rancher/k3s/server/db/).
  - For any K8s: Back up etcd if it's external.
  - Velero (https://velero.io/) is an excellent tool for backing up and restoring Kubernetes cluster resources and persistent volumes.
  - Keep all your YAML manifests version-controlled (e.g., in Git).
- Persistent Volumes (PVCs): Implement robust backup strategies for data on your PVCs (NPM config, Ollama models, Open WebUI data, Vector DB data). This might involve:
  - Storage solution snapshot capabilities.
  - Volume-level backup tools.
  - Rsync-based backups for critical data.
Namespace Management: Continue using distinct Kubernetes namespaces (e.g., npm, ollama, open-webui, ai-db, monitoring, fine-tuning-jobs) to logically separate resources, improve organization, apply resource quotas, and implement fine-grained network policies.
Resource Management:
- Define resource requests and limits (CPU, memory) for your Kubernetes deployments to ensure fair resource distribution and prevent noisy neighbor problems.
- Monitor resource usage closely to identify bottlenecks or needs for scaling.

Phase 5: Scaling, Iteration, and Future Considerations#

Scaling Your Platform:
- Stateless Services (Open WebUI, NPM): Scale by increasing replica counts in their Deployments.
- Ollama: Scaling Ollama instances across multiple nodes with GPUs requires consideration for model distribution or using a shared, network-accessible model cache.
- Vector Database: Most vector databases can be scaled; refer to their specific documentation (e.g., Qdrant clustering).
- Kubernetes Cluster: Add more worker nodes to your K3s cluster as needed.
Iterative Improvement:
- The AI field is evolving rapidly. Regularly evaluate new models, tools, and techniques.
- Gather feedback from users of your internal AI platform to guide improvements.
Future Enhancements:
- Authentication & Authorization: Integrate Open WebUI and other services with your company's SSO/IdP (e.g., LDAP, OIDC).
- Model Management & Versioning: Implement more sophisticated model lifecycle management if you start fine-tuning many models.
- Automated CI/CD Pipelines: Automate the deployment and update process for your AI services and fine-tuning jobs.
- Cost Optimization: Monitor resource usage to optimize hardware and potentially explore spot instances or reserved instances if in a cloud environment.

Conclusion: Empowering Your Organization with AI#

Building your own secure and scalable internal AI platform is a significant but profoundly rewarding endeavor. This guide has provided a comprehensive blueprint, but the journey requires ongoing dedication to maintenance, security, and adaptation.

Key Benefits Re-emphasized:

True Data Sovereignty: Your sensitive company data stays within your control.
Customized AI Capabilities: Tailor models and applications to your specific business needs.
Enhanced Security: Reduce reliance on third-party AI services for confidential tasks.
Potential Long-Term Cost Savings: Compared to per-API call costs of commercial services, especially at scale.
Innovation Catalyst: Provide a sandbox for your teams to experiment and innovate with AI.

Start with a minimal viable product (MVP), iterate based on feedback and needs, and embrace continuous learning. The power to shape your company's AI future is now firmly in your hands. Good luck!