Your In-Depth Guide to Building a Secure & Scalable Internal AI Platform (June 2025 Edition)

June 5, 202522 min read

The Vision: Your Private AI Powerhouse#

The desire for bespoke, secure, and scalable AI solutions within organizations is no longer a futuristic dream but a present-day necessity. This guide provides a detailed roadmap to construct a robust internal AI platform. Doing so grants your organization the transformative power of modern Large Language Models (LLMs) while ensuring complete data sovereignty, security, and control over your AI destiny.

Our goal is to build a company-controlled AI ecosystem that offers:

  • Secure, internal access to powerful LLMs.
  • An intuitive chat interface for employees.
  • The ability to "chat with your documents" using Retrieval Augmented Generation (RAG).
  • A solid foundation for developing and deploying custom fine-tuned AI models.

Who is this guide for? This guide is primarily for IT teams, DevOps engineers, and system administrators who have experience with server administration, Linux, networking, and ideally, some familiarity with containerization and orchestration concepts. While we aim for clarity, a foundational understanding of these areas will be highly beneficial.

Core Technology Stack & Rationale:

  • Foundation: Dedicated Server(s) with powerful GPU(s) – Essential for LLM performance.
  • Orchestration: Kubernetes (K3s recommended) – For robust, scalable, and resilient management of our AI services. K3s offers a lightweight yet fully compliant Kubernetes experience, ideal for on-premise and quicker setup.
  • Containerization: Docker – The industry standard for packaging applications and their dependencies.
  • Secure Access & Ingress: VPN (e.g., WireGuard) & Nginx Proxy Manager (NPM) – VPN for secure network-level access, NPM for user-friendly HTTPS subdomains and reverse proxying to internal services.
  • AI Model Serving: Ollama – A user-friendly tool for running open-source LLMs locally, with excellent GPU support.
  • Chat Interface: Open WebUI – A feature-rich, self-hosted web UI that integrates seamlessly with Ollama.
  • RAG Backend: Vector Database (e.g., Qdrant, ChromaDB) – To store document embeddings efficiently for fast retrieval in RAG systems.
  • Efficient Fine-Tuning: Unsloth – An innovative library to significantly speed up LLM fine-tuning and reduce memory usage, making custom model development more accessible.

Disclaimer: This is an advanced blueprint. Specific commands, configurations, and software versions will evolve. Always consult the official documentation for each component and adapt instructions to your specific environment and security policies.

Phase 1: Laying the Foundational Infrastructure#

A robust foundation is critical. This phase covers server preparation, OS setup, and core software installation.

1.1. Server(s) Preparation: The Bedrock of Your AI#

  • Hardware Selection – Don't Skimp Here!
    • CPU: Modern multi-core Intel Xeon or AMD EPYC/Ryzen. Needed for general system operations and supporting AI workloads.
    • RAM: Minimum 64GB, but 128GB+ is strongly recommended. LLMs can be memory-hungry, especially when running multiple models or handling large RAG databases. More RAM also helps if GPU VRAM is a constraint.
    • GPU (Graphics Processing Unit): The single most critical component for LLM performance.
      • NVIDIA: The preferred choice due to mature CUDA drivers and broad software support (Ollama, PyTorch, Unsloth, etc.). Aim for series like RTX 30xx/40xx, or professional A-series (A4000, A6000) / H-series (H100, L40S) for larger budgets.
      • VRAM is King: Ensure ample VRAM (Video RAM) on your GPUs. 16GB is a bare minimum for smaller models; 24GB-48GB+ per GPU is ideal for comfortably running larger, more capable models and for fine-tuning.
      • AMD: ROCm support is maturing but can require more configuration. Carefully check compatibility with Ollama, PyTorch, and Kubernetes device plugins.
    • Storage: Fast NVMe SSDs are crucial for OS responsiveness, quick container image loading, and fast model access. 1TB+ is a good starting point, but factor in space for multiple large models (some are 50GB+), RAG indexes, and system logs.
    • Networking: 1Gbps NIC is a minimum. For multi-node clusters or heavy usage, 10Gbps+ is recommended to avoid bottlenecks.
    • Initial Strategy: Starting with one powerful server can simplify initial setup. You can architect for multi-node expansion later.
  • Operating System:
    • A stable Linux distribution is key. Ubuntu Server 22.04 LTS or 24.04 LTS are excellent choices due to wide community support and compatibility. RHEL or Debian are also viable.
  • Initial Server Setup & Security Hardening:
    • Install your chosen OS.
    • Update all packages immediately: sudo apt update && sudo apt upgrade -y.
    • Configure static IP addresses for your server(s) for stable network identity.
    • Implement a firewall (e.g., ufw on Ubuntu). Only allow necessary ports: SSH (non-standard port recommended), HTTP/HTTPS for NPM, your VPN port, and Kubernetes-specific ports as required by your setup.
    • Secure SSH: Disable root login, enforce key-based authentication, and consider tools like Fail2ban.
  • NVIDIA Drivers & Toolkit (Crucial Prerequisite):
    • If using NVIDIA GPUs, install the proprietary NVIDIA drivers and the nvidia-container-toolkit before installing Docker or Kubernetes. This toolkit allows containers to access NVIDIA GPUs. Follow NVIDIA's official documentation meticulously.

1.2. Docker Installation: The Container Engine#

Docker packages our AI services into portable containers.

  • Install Docker Engine using the official guide: https://docs.docker.com/engine/install/
    # Example for Ubuntu (always verify with official Docker documentation)
    sudo apt-get update
    sudo apt-get install -y apt-transport-https ca-certificates curl gnupg software-properties-common
    curl -fsSL [https://download.docker.com/linux/ubuntu/gpg](https://download.docker.com/linux/ubuntu/gpg) | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
    echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] [https://download.docker.com/linux/ubuntu](https://download.docker.com/linux/ubuntu) $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
    sudo apt-get update
    sudo apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
    sudo systemctl start docker
    sudo systemctl enable docker
    # Allow your user to run Docker commands without sudo (requires logout/login or 'newgrp docker')
    sudo usermod -aG docker $USER
    newgrp docker # Activates the group change for the current session

    Tip: After installation, verify Docker is running with docker --version and sudo systemctl status docker.

1.3. Kubernetes Cluster Setup (using K3s)#

K3s is a lightweight, certified Kubernetes distribution ideal for on-premise setups, edge computing, and development. It's simpler to install and manage than full K8s.

  • Why K3s? It bundles essential components, has a small footprint, and offers a straightforward path to a functional cluster.
  • Installing K3s (Single Master Node): We'll configure K3s to use Docker as its container runtime (ensure Docker is already NVIDIA-aware if you have GPUs) and disable K3s's built-in Traefik ingress, as we'll use Nginx Proxy Manager.
    # Ensure NVIDIA drivers and nvidia-container-toolkit are installed and Docker is configured for NVIDIA runtime if applicable.
    curl -sfL [https://get.k3s.io](https://get.k3s.io) | INSTALL_K3S_EXEC="--docker --disable=traefik" sh -s - --write-kubeconfig-mode 644
    
  • Configure kubectl Access (Your Kubernetes Command-Line Tool):
    sudo mkdir -p $HOME/.kube
    sudo cp /etc/rancher/k3s/k3s.yaml $HOME/.kube/config
    sudo chown $(id -u):$(id -g) $HOME/.kube/config
    export KUBECONFIG=$HOME/.kube/config # Add this line to your shell profile (e.g., .bashrc, .zshrc)
    Verify access: kubectl get nodes -o wide (should show your master node).
  • Enabling NVIDIA GPU Support in Kubernetes: The NVIDIA GPU Operator automates the management of NVIDIA GPU resources in Kubernetes.
    helm repo add nvidia [https://helm.ngc.nvidia.com/nvidia](https://helm.ngc.nvidia.com/nvidia) && helm repo update
    # This command installs the operator which then deploys necessary components like device plugins.
    helm install gpu-operator nvidia/gpu-operator \
      --namespace gpu-operator --create-namespace --wait
    Verify GPU availability to Kubernetes: kubectl describe nodes | grep nvidia.com/gpu. You should see allocatable GPU resources.
  • Persistent Storage:
    • K3s includes a Local Path Provisioner by default. This uses directories on the host node for persistent storage, which is fine for a single-node cluster.
    • For multi-node clusters or production setups: You'll need a robust network storage solution like NFS, Ceph, Rook, or a cloud-native option like Longhorn (https://longhorn.io/).

1.4. VPN Setup: Secure Network Access#

A VPN (Virtual Private Network) is essential for secure administrative access and potentially for users to connect to your internal AI platform as if they were on the local network.

  • Options:
    • WireGuard: Modern, fast, and generally simpler to configure. Consider using wg-easy (a Docker image providing a web UI for WireGuard).
    • OpenVPN: A mature and robust option, though potentially more complex.
  • Setup Steps:
    1. Install VPN server software on a dedicated VM, a container, or an infrastructure server (ensure firewall rules are updated).
    2. Configure server settings (IP ranges, DNS for clients) and generate client profiles/keys.
    3. Configure your external firewall/router to forward the VPN port (e.g., UDP 51820 for WireGuard) to your VPN server.
    4. Ensure VPN clients are assigned IP addresses that can route to your Kubernetes nodes and internal service IPs.

Phase 2: Deploying Core AI Services on Kubernetes#

We'll deploy Nginx Proxy Manager, Ollama, and Open WebUI. It's good practice to use separate Kubernetes namespaces for better organization and resource management.

General Kubernetes Workflow for Each Service:

  1. Create a Namespace: kubectl create namespace <namespace-name>
  2. Define a PersistentVolumeClaim (PVC) for any stateful data.
  3. Define a Deployment to manage the application pods.
  4. Define a Service to provide a stable internal network endpoint for the Deployment.

2.1. Nginx Proxy Manager (NPM): Your Secure Gateway#

NPM will handle SSL/TLS termination (HTTPS) and act as a reverse proxy, routing requests from user-friendly subdomains to the correct internal services.

  • Create Namespace: kubectl create namespace npm
  • NPM Persistent Storage (npm-pvc.yaml): Stores NPM configuration and Let's Encrypt certificates.
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: npm-data-pvc
      namespace: npm
    spec:
      accessModes: ["ReadWriteOnce"] # Suitable for a single NPM pod
      resources:
        requests:
          storage: 10Gi # Adjust as needed
    Apply: kubectl apply -f npm-pvc.yaml -n npm
  • NPM Deployment & Service (npm-deployment.yaml):
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: nginx-proxy-manager
      namespace: npm
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: npm
      template:
        metadata:
          labels:
            app: npm
        spec:
          containers:
          - name: npm-app
            image: 'jc21/nginx-proxy-manager:latest' # Consider pinning to a specific version tag for stability
            ports:
            - containerPort: 80  # HTTP
            - containerPort: 81  # Admin UI
            - containerPort: 443 # HTTPS
            volumeMounts:
            - name: npm-data
              mountPath: /data
            - name: npm-letsencrypt
              mountPath: /etc/letsencrypt
          volumes:
          - name: npm-data
            persistentVolumeClaim:
              claimName: npm-data-pvc
          - name: npm-letsencrypt # Often stored within the /data volume by NPM itself
            persistentVolumeClaim: # but can be explicitly managed if needed
              claimName: npm-data-pvc # Example: using the same PVC, or a subPath
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: npm-service
      namespace: npm
    spec:
      type: LoadBalancer # For on-prem K3s without a cloud provider, you might need MetalLB ([https://metallb.universe.tf/](https://metallb.universe.tf/)) for this to get an external IP. Alternatively, use NodePort.
      selector:
        app: npm
      ports:
      - name: http
        port: 80
        targetPort: 80
      - name: https
        port: 443
        targetPort: 443
      - name: admin
        port: 81 # Expose admin UI externally
        targetPort: 81
    Apply: kubectl apply -f npm-deployment.yaml -n npm
  • Initial NPM Setup:
    1. Access the NPM admin UI. If using LoadBalancer, find its External IP (kubectl get svc -n npm npm-service). If NodePort, use <NodeIP>:<NodePortFor81>.
    2. Default login: admin@example.com / changeme. Change this immediately!
  • DNS Configuration: In your internal (or public, if applicable) DNS, create CNAME or A records for your desired subdomains (e.g., chat.internal.yourcompany.com) pointing to the external IP of the npm-service.

    Tip: DNS changes can take time to propagate.

2.2. Ollama: Serving Your LLMs#

Ollama makes running open-source LLMs straightforward.

  • Create Namespace: kubectl create namespace ollama
  • Ollama Models Storage (ollama-pvc.yaml): Downloaded models are large.
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: ollama-models-pvc
      namespace: ollama
    spec:
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: 200Gi # Adjust based on how many models and their sizes
    Apply: kubectl apply -f ollama-pvc.yaml -n ollama
  • Ollama Deployment & Service (ollama-deployment.yaml):
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: ollama
      namespace: ollama
    spec:
      replicas: 1 # For multiple GPUs/nodes, scaling requires careful model management or shared storage for models.
      strategy:
        type: Recreate # Often preferred for GPU workloads to ensure clean release/acquisition of GPU resources during updates.
      selector:
        matchLabels:
          app: ollama
      template:
        metadata:
          labels:
            app: ollama
        spec:
          containers:
          - name: ollama
            image: ollama/ollama:latest # Pin to a specific version for production
            ports:
            - containerPort: 11434
            volumeMounts:
            - name: ollama-models
              mountPath: /root/.ollama # Ollama's default model storage path
            resources: # CRITICAL for GPU access
              limits:
                [nvidia.com/gpu](https://nvidia.com/gpu): 1 # Request 1 GPU from Kubernetes
              # requests: # Optionally, set requests equal to limits for guaranteed allocation
              #   [nvidia.com/gpu](https://nvidia.com/gpu): 1
            # Consider adding liveness and readiness probes for better health management.
          volumes:
          - name: ollama-models
            persistentVolumeClaim:
              claimName: ollama-models-pvc
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: ollama-service
      namespace: ollama
    spec:
      type: ClusterIP # This service is internal; Open WebUI will access it via this ClusterIP.
      selector:
        app: ollama
      ports:
      - port: 11434
        targetPort: 11434
    
    Apply: kubectl apply -f ollama-deployment.yaml -n ollama
  • Pulling Your First Model:
    # Get the name of your Ollama pod
    OLLAMA_POD=$(kubectl get pods -n ollama -l app=ollama -o jsonpath='{.items[0].metadata.name}')
    # Execute the pull command inside the pod
    kubectl exec -it -n ollama $OLLAMA_POD -- ollama pull llama3:8b # Example: Llama 3 8B
    # Check Ollama logs for pull progress and GPU detection:
    kubectl logs -n ollama -f $OLLAMA_POD

2.3. Open WebUI: The Chat Interface#

Open WebUI provides a user-friendly interface to interact with models served by Ollama.

  • Create Namespace: kubectl create namespace open-webui
  • Open WebUI Data Storage (open-webui-pvc.yaml): For configuration, user data, RAG history etc.
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: open-webui-data-pvc
      namespace: open-webui
    spec:
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: 20Gi # Adjust as needed
    Apply: kubectl apply -f open-webui-pvc.yaml -n open-webui
  • Open WebUI Deployment & Service (open-webui-deployment.yaml):
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: open-webui
      namespace: open-webui
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: open-webui
      template:
        metadata:
          labels:
            app: open-webui
        spec:
          containers:
          - name: open-webui
            image: ghcr.io/open-webui/open-webui:main # Pin to a specific version tag for production
            ports:
            - containerPort: 8080 # Default port for Open WebUI
            env:
            # This is crucial: points Open WebUI to your internal Ollama Kubernetes service.
            - name: OLLAMA_BASE_URL
              value: "[http://ollama-service.ollama.svc.cluster.local:11434](http://ollama-service.ollama.svc.cluster.local:11434)"
            # Add other Open WebUI environment variables as needed (e.g., for enabling RAG, authentication)
            volumeMounts:
            - name: open-webui-data
              mountPath: /app/backend/data # Verify this path in Open WebUI's official documentation
          volumes:
          - name: open-webui-data
            persistentVolumeClaim:
              claimName: open-webui-data-pvc
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: open-webui-service
      namespace: open-webui
    spec:
      type: ClusterIP # Internal service; will be exposed via Nginx Proxy Manager.
      selector:
        app: open-webui
      ports:
      - port: 8080 # The port this service listens on
        targetPort: 8080 # The port the container is listening on
    
    Apply: kubectl apply -f open-webui-deployment.yaml -n open-webui
  • Expose Open WebUI via Nginx Proxy Manager:
    1. Navigate to your NPM admin UI.
    2. Click "Proxy Hosts" then "Add Proxy Host".
    3. Domain Names: Enter your chosen subdomain (e.g., chat.internal.yourcompany.com).
    4. Scheme: http (NPM handles SSL/TLS).
    5. Forward Hostname / IP: open-webui-service.open-webui.svc.cluster.local (This is the Kubernetes internal DNS name for the Open WebUI service).
    6. Forward Port: 8080 (the port specified in the Open WebUI service).
    7. Enable: "Block Common Exploits" and crucially "Websockets Support" (WebUI uses websockets).
    8. Go to the SSL tab:
      • Select "Request a new SSL certificate".
      • Enable "Force SSL" and "HTTP/2 Support".
      • If internal.yourcompany.com is a subdomain of a public domain you own and can manage via a supported DNS provider, use the "DNS Challenge" for Let's Encrypt. This is the most robust way to get certs for internal names.
    9. Save. You should now be able to access Open WebUI at https://chat.internal.yourcompany.com.

Phase 3: Unlocking Advanced AI Capabilities#

With the core services running, let's explore advanced features.

3.1. Retrieval Augmented Generation (RAG): Chat with Your Company's Documents#

RAG allows your LLM to access and cite information from your private documents, providing contextually relevant answers.

  • Vector Database on Kubernetes (e.g., Qdrant): Vector databases store "embeddings" – numerical representations of your document content – allowing for fast similarity searches.
    • Create Namespace: kubectl create namespace ai-db
    • Deploy using its official Helm chart (recommended for ease of management):
      helm repo add qdrant [https://qdrant.github.io/qdrant-helm](https://qdrant.github.io/qdrant-helm)
      helm install qdrant qdrant/qdrant -n ai-db \
        --set persistence.enabled=true \
        --set persistence.size=50Gi # Adjust storage based on expected document volume
      This creates a service like qdrant.ai-db.svc.cluster.local.
  • Integrating RAG with Open WebUI:
    1. Open WebUI has built-in RAG support. Access its admin settings panel.
    2. Document Preprocessing & Ingestion:
      • Preprocessing is key! Clean your documents, ensure good formatting, and consider how they will be split into chunks. Quality in, quality out.
      • Upload documents (PDF, TXT, MD, etc.) through the Open WebUI interface.
    3. Embedding Model:
      • An embedding model converts text chunks into vectors. Examples: nomic-embed-text, mxbai-embed-large. Ensure your chosen model is pulled to your Ollama instance:
        # OLLAMA_POD should be set as in section 2.2
        kubectl exec -it -n ollama $OLLAMA_POD -- ollama pull nomic-embed-text
      • In Open WebUI settings, configure it to use this embedding model via your Ollama service.
    4. Vector DB Connection: Configure Open WebUI to connect to your Qdrant service URL (e.g., http://qdrant.ai-db.svc.cluster.local:6333 – Qdrant's default gRPC port is 6333, HTTP is 6334). Check Qdrant's service details.
  • For More Control: Custom RAG Pipelines: For highly tailored RAG, consider building custom applications using frameworks like LangChain or LlamaIndex. These applications would also be deployed as services in Kubernetes, interacting with Ollama and your vector DB.

3.2. Efficient Model Fine-Tuning with Unsloth#

Fine-tuning adapts a pre-trained LLM to your company's specific jargon, style, or tasks. Unsloth significantly optimizes this process, especially for LoRA (Low-Rank Adaptation) fine-tuning.

  • Fine-Tuning: A Separate, Resource-Intensive Workflow:
    • This is not done within Ollama but is a preparatory step to create a custom model.
    • It's iterative and requires experimentation.
  • Dataset Preparation: The Most Critical Part!
    • High-quality, domain-specific datasets are paramount. Format them appropriately (e.g., instruction-response pairs, conversational data). Garbage in, garbage out.
  • Environment for Fine-Tuning:
    • Create a dedicated Docker image containing: Python, PyTorch (GPU-enabled), Unsloth, Hugging Face libraries (transformers, datasets, peft, trl), and your training scripts.
    • Install Unsloth (refer to their GitHub for the latest commands for your CUDA/ROCm version):
      • NVIDIA CUDA example: pip install "unsloth[cu121-ampere-torch212]" (Tailor to your GPU architecture and PyTorch version)
      • AMD ROCm example: pip install "unsloth[rocm_6_0-mi200-torch212]"
  • Running Fine-Tuning Jobs in Kubernetes:
    • Define a Kubernetes Job that runs your fine-tuning script in a pod with dedicated GPU resources.
    • The job should:
      1. Mount your datasets (e.g., from a PVC or cloud storage).
      2. Execute the Unsloth-optimized training script.
      3. Save the resulting fine-tuned model adapter (e.g., LoRA weights) or the fully merged model to persistent storage.
  • Example Unsloth Training Snippet (Conceptual - see Unsloth docs for details):
    from unsloth import FastLanguageModel
    import torch
    
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "unsloth/llama-3-8b-bnb-4bit", # Choose a model supported by Unsloth
        max_seq_length = 2048, # Or your desired sequence length
        load_in_4bit = True,   # Or load_in_8bit=True for better quality if VRAM allows
    )
    # Apply LoRA adapters
    model = FastLanguageModel.get_peft_model(
        model,
        r = 16, # LoRA rank (e.g., 8, 16, 32) - higher can mean more learnable parameters but more VRAM
        target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], # Common Llama modules
        lora_alpha = 16, # Often set equal to r
        lora_dropout = 0, # Or a small value like 0.05 or 0.1
        bias = "none",
        use_gradient_checkpointing = True, # Saves memory
    )
    # ... (Load your dataset using Hugging Face datasets)
    # ... (Configure and run Hugging Face Trainer with your Unsloth model)
    # model.save_pretrained("lora_model_output") # Saves LoRA adapters
  • Importing Fine-Tuned Models into Ollama:
    1. Merge LoRA Adapters (if applicable): If you trained with LoRA, you typically need to merge these adapters into the base model to create a single, deployable model. Unsloth provides methods for this, or you can use Hugging Face PEFT.
    2. Convert to GGUF Format: Ollama primarily uses the GGUF model format. Tools like llama.cpp provide scripts to convert Hugging Face models (especially after merging LoRA) to GGUF. This step often involves quantization, which reduces model size and can speed up inference, sometimes with a small quality trade-off.
    3. Create an Ollama Modelfile: This file tells Ollama how to load and use your custom GGUF model.
      # Example for a Llama 3 fine-tune
      FROM ./your-finetuned-company-model.gguf # Path to the GGUF file (relative to the Modelfile location when creating)
      
      # Define the prompt template your model was trained with or expects
      TEMPLATE """<|begin_of_text|><|start_header_id|>system<|end_header_id|>
      
      {{ .System }}<|eot_id|><|start_header_id|>user<|end_header_id|>
      
      {{ .Prompt }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
      
      {{ .Response }}<|eot_id|>"""
      
      # Set default parameters
      PARAMETER temperature 0.6
      PARAMETER top_k 40
      PARAMETER top_p 0.9
      
      # Optionally, set a default system prompt
      SYSTEM """You are a helpful AI assistant specialized in [Your Company's Domain]."""
    4. Make Model Available to Ollama:
      • Copy the GGUF file and the Modelfile into a directory accessible by your Ollama instance (e.g., onto its PVC, or exec into the pod and place it in /root/.ollama/models or a temporary location).
    5. Create the Model in Ollama:
      # OLLAMA_POD should be set as in section 2.2
      # Assume Modelfile and GGUF are in /tmp inside the pod for this example:
      kubectl cp ./Modelfile ${OLLAMA_POD}:/tmp/Modelfile -n ollama
      kubectl cp ./your-finetuned-company-model.gguf ${OLLAMA_POD}:/tmp/your-finetuned-company-model.gguf -n ollama
      
      kubectl exec -it -n ollama $OLLAMA_POD -- ollama create your-company-model -f /tmp/Modelfile
      Your custom model your-company-model is now available via Ollama and Open WebUI!

Phase 4: Security, Operations, and Best Practices – The Long Game#

Building is one thing; maintaining and securing is another.

  • Kubernetes Security Best Practices:
    • RBAC (Role-Based Access Control): Implement least privilege. Define specific Roles and RoleBindings for users and service accounts.
    • Network Policies: Restrict pod-to-pod communication. For example, only allow Open WebUI pods to connect to Ollama pods, and only specific services to connect to the vector DB.
    • Secrets Management: Use Kubernetes Secrets for all sensitive data like API keys, passwords, and certificates. Consider integrating with Vault for advanced secret management.
    • Pod Security Standards / PodSecurityPolicies (or their successor): Apply appropriate security contexts to your pods to limit their capabilities.
    • Image Scanning: Integrate tools to scan your container images for vulnerabilities.
  • HTTPS Everywhere: Enforced by Nginx Proxy Manager using valid SSL/TLS certificates. Regularly renew certificates.
  • VPN Access: Continue to use strong VPN practices for administrative tasks and secure remote access.
  • Consistent Updates & Patch Management: CRITICAL!
    • Regularly update: OS, Kubernetes (K3s), Docker, NVIDIA drivers, and all deployed applications (NPM, Ollama, Open WebUI, Vector DB, etc.).
    • Pin image versions: In your Kubernetes manifests, use specific version tags for all container images (e.g., ollama/ollama:0.1.40 instead of ollama/ollama:latest). This prevents unexpected breaking changes from :latest tag updates.
    • Subscribe to security advisories for all software components.
  • Ollama API Security: If you choose to expose the Ollama API directly via NPM (not just through Open WebUI), implement an authentication layer in NPM (e.g., Basic Auth, Forward Auth to an IdP) or use stricter Kubernetes network policies.
  • Monitoring and Logging: Your Eyes and Ears
    • Kubernetes Cluster & Application Monitoring: Deploy Prometheus and Grafana (the kube-prometheus-stack Helm chart is excellent) for metrics on nodes, pods, deployments, and services.
    • GPU Monitoring: The NVIDIA GPU Operator often includes DCGM (Data Center GPU Manager) exporter, or you can deploy dcgm-exporter separately to feed detailed GPU metrics (utilization, memory, temperature) into Prometheus.
    • Centralized Logging: Implement a logging stack like the EFK Stack (Elasticsearch, Fluentd/Fluentbit, Kibana) or Grafana Loki to aggregate logs from all Kubernetes pods for easier troubleshooting and auditing.
  • Data Backup and Recovery Strategy:
    • Kubernetes State:
      • For K3s using embedded SQLite: Regularly back up the K3s server data directory (e.g., /var/lib/rancher/k3s/server/db/).
      • For any K8s: Back up etcd if it's external.
      • Velero (https://velero.io/) is an excellent tool for backing up and restoring Kubernetes cluster resources and persistent volumes.
      • Keep all your YAML manifests version-controlled (e.g., in Git).
    • Persistent Volumes (PVCs): Implement robust backup strategies for data on your PVCs (NPM config, Ollama models, Open WebUI data, Vector DB data). This might involve:
      • Storage solution snapshot capabilities.
      • Volume-level backup tools.
      • Rsync-based backups for critical data.
  • Namespace Management: Continue using distinct Kubernetes namespaces (e.g., npm, ollama, open-webui, ai-db, monitoring, fine-tuning-jobs) to logically separate resources, improve organization, apply resource quotas, and implement fine-grained network policies.
  • Resource Management:
    • Define resource requests and limits (CPU, memory) for your Kubernetes deployments to ensure fair resource distribution and prevent noisy neighbor problems.
    • Monitor resource usage closely to identify bottlenecks or needs for scaling.

Phase 5: Scaling, Iteration, and Future Considerations#

  • Scaling Your Platform:
    • Stateless Services (Open WebUI, NPM): Scale by increasing replica counts in their Deployments.
    • Ollama: Scaling Ollama instances across multiple nodes with GPUs requires consideration for model distribution or using a shared, network-accessible model cache.
    • Vector Database: Most vector databases can be scaled; refer to their specific documentation (e.g., Qdrant clustering).
    • Kubernetes Cluster: Add more worker nodes to your K3s cluster as needed.
  • Iterative Improvement:
    • The AI field is evolving rapidly. Regularly evaluate new models, tools, and techniques.
    • Gather feedback from users of your internal AI platform to guide improvements.
  • Future Enhancements:
    • Authentication & Authorization: Integrate Open WebUI and other services with your company's SSO/IdP (e.g., LDAP, OIDC).
    • Model Management & Versioning: Implement more sophisticated model lifecycle management if you start fine-tuning many models.
    • Automated CI/CD Pipelines: Automate the deployment and update process for your AI services and fine-tuning jobs.
    • Cost Optimization: Monitor resource usage to optimize hardware and potentially explore spot instances or reserved instances if in a cloud environment.

Conclusion: Empowering Your Organization with AI#

Building your own secure and scalable internal AI platform is a significant but profoundly rewarding endeavor. This guide has provided a comprehensive blueprint, but the journey requires ongoing dedication to maintenance, security, and adaptation.

Key Benefits Re-emphasized:

  • True Data Sovereignty: Your sensitive company data stays within your control.
  • Customized AI Capabilities: Tailor models and applications to your specific business needs.
  • Enhanced Security: Reduce reliance on third-party AI services for confidential tasks.
  • Potential Long-Term Cost Savings: Compared to per-API call costs of commercial services, especially at scale.
  • Innovation Catalyst: Provide a sandbox for your teams to experiment and innovate with AI.

Start with a minimal viable product (MVP), iterate based on feedback and needs, and embrace continuous learning. The power to shape your company's AI future is now firmly in your hands. Good luck!