Skip to content

Proxmox Talos Kubernetes Cluster

This guide covers deploying a production-ready Talos Kubernetes cluster on Proxmox VE using Infrastructure as Code (OpenTofu/Terraform).

Overview

The testing environment uses:

  • Proxmox VE: Virtualization platform
  • Talos Linux: Immutable, minimal OS designed for Kubernetes
  • OpenTofu/Terraform: Infrastructure as Code for repeatable deployments
  • Cilium: eBPF-based CNI with kube-proxy replacement

Cluster Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                         Proxmox VE Host (pve2)                      │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐                 │
│  │   CP-01     │  │   CP-02     │  │   CP-03     │                 │
│  │ 192.168.30.21│ │ 192.168.30.22│ │ 192.168.30.23│                │
│  │  VM ID: 501 │  │  VM ID: 502 │  │  VM ID: 503 │                 │
│  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘                 │
│         │                │                │                         │
│         └────────────────┼────────────────┘                         │
│                          │                                          │
│                   VIP: 192.168.30.20                                │
│                   (Kubernetes API)                                  │
│                          │                                          │
│         ┌────────────────┼────────────────┐                         │
│         │                │                │                         │
│  ┌──────┴──────┐  ┌──────┴──────┐  ┌──────┴──────┐                 │
│  │   WN-01     │  │   WN-02     │  │   WN-03     │                 │
│  │ 192.168.30.24│ │ 192.168.30.25│ │ 192.168.30.26│                │
│  │  VM ID: 504 │  │  VM ID: 505 │  │  VM ID: 506 │                 │
│  │ +100GB data │  │ +100GB data │  │ +100GB data │                 │
│  └─────────────┘  └─────────────┘  └─────────────┘                 │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Prerequisites

Software Requirements

Tool Version Purpose
OpenTofu/Terraform >= 1.5 Infrastructure provisioning
talosctl v1.12.0 Talos cluster management
kubectl >= 1.32 Kubernetes management
helm >= 3.x Cilium installation

Proxmox Requirements

  1. Proxmox VE version 8.x or later
  2. Storage pool with sufficient space (recommended: 500GB+)
  3. Network bridge (vmbr0) configured
  4. Talos nocloud template (VM ID 9001)

Network Requirements

Resource IP/Range Purpose
Control Plane VIP 192.168.30.20 Kubernetes API HA endpoint
Control Plane Nodes 192.168.30.21-23 etcd, API server, scheduler
Worker Nodes 192.168.30.24-26 Application workloads
Gateway 192.168.30.1 Network gateway
Pod CIDR 100.64.0.0/16 Pod networking
Service CIDR 172.20.0.0/16 Service networking

Proxmox Setup

Step 1: Download Talos Image

Download the Talos nocloud image for Proxmox:

# Download the latest Talos nocloud image
wget https://github.com/siderolabs/talos/releases/download/v1.12.0/nocloud-amd64.raw.xz

# Extract the image
xz -d nocloud-amd64.raw.xz

Step 2: Create Template VM

Create a template VM in Proxmox that will be cloned for all nodes:

# SSH to Proxmox host
ssh [email protected]

# Create a new VM for the template
qm create 9001 --name talos-template --memory 2048 --cores 2 --net0 virtio,bridge=vmbr0

# Import the Talos disk
qm importdisk 9001 /path/to/nocloud-amd64.raw local-lvm

# Attach the disk to the VM
qm set 9001 --scsihw virtio-scsi-pci --scsi0 local-lvm:vm-9001-disk-0

# Set boot order
qm set 9001 --boot order=scsi0

# Enable QEMU guest agent
qm set 9001 --agent enabled=1

# Add cloud-init drive
qm set 9001 --ide2 local-lvm:cloudinit

# Convert to template
qm template 9001

Step 3: Verify Template

# List templates
qm list | grep template

# Check template configuration
qm config 9001

Terraform Deployment

Directory Structure

terraform/
├── environments/
│   └── testing/
│       ├── main.tf              # Main configuration
│       ├── variables.tf         # Variable definitions
│       ├── outputs.tf           # Output definitions
│       ├── terraform.tfvars     # Variable values
│       └── terraform.tfvars.example
└── modules/
    ├── proxmox/
    │   └── vm/                  # Proxmox VM module
    └── talos/
        └── proxmox/             # Talos configuration module

Step 1: Configure Variables

Create or update terraform.tfvars:

# Proxmox Connection
proxmox_endpoint     = "https://192.168.30.10:8006"
proxmox_username     = "root@pam"
proxmox_password     = "your-password"  # Use TF_VAR_proxmox_password
proxmox_insecure     = true
proxmox_ssh_username = "root"
proxmox_node         = "pve2"

# Cluster Configuration
cluster_name        = "rciis-testing"
control_plane_count = 3
worker_count        = 3

# VM IDs
control_plane_vm_id_start = 501
worker_vm_id_start        = 504
template_vm_id            = 9001

# Control Plane Resources
control_plane_cpu_cores    = 2
control_plane_memory_mb    = 4096
control_plane_disk_size_gb = 50

# Worker Resources
worker_cpu_cores         = 2
worker_memory_mb         = 4096
worker_disk_size_gb      = 50
worker_data_disk_size_gb = 100  # Additional data disk

# Network
network_bridge = "vmbr0"
ipv4_gateway   = "192.168.30.1"
dns_servers    = ["1.1.1.1", "192.168.10.17"]

control_plane_ips = [
  "192.168.30.21/24",
  "192.168.30.22/24",
  "192.168.30.23/24",
]

worker_ips = [
  "192.168.30.24/24",
  "192.168.30.25/24",
  "192.168.30.26/24",
]

# High Availability
control_plane_vip           = "192.168.30.20"
control_plane_vip_interface = "eth0"

# Versions
talos_version      = "v1.12.0"
kubernetes_version = "1.32.0"

# CNI Configuration
cni_name           = "none"   # Install Cilium manually
disable_kube_proxy = true     # Cilium replaces kube-proxy

tags = ["rciis", "testing"]

Step 2: Initialize and Plan

cd terraform/environments/testing

# Initialize Terraform
tofu init

# Review the plan
tofu plan

Expected output summary:

Plan: 21 to add, 0 to change, 0 to destroy.

Resources to create:
- 3 control plane VMs (501-503)
- 3 worker VMs (504-506) with data disks
- Talos machine secrets
- Talos machine configurations
- Talos bootstrap
- Kubeconfig and talosconfig

Step 3: Apply Configuration

# Deploy the cluster
tofu apply

# Type 'yes' when prompted

Deployment Time

Initial deployment takes 10-15 minutes. The process:

  1. Clones VMs from template (~2 min per VM)
  2. Configures cloud-init networking
  3. Applies Talos machine configuration
  4. Bootstraps the cluster
  5. Waits for kubeconfig

Step 4: Save Credentials

After successful deployment, save the cluster credentials:

# Create config directories
mkdir -p ~/.talos ~/.kube

# Save talosconfig
tofu output -raw talosconfig > ~/.talos/config

# Save kubeconfig
tofu output -raw kubeconfig > ~/.kube/config

# Backup machine secrets (store securely!)
tofu output -raw machine_secrets > talos-secrets-backup.yaml
chmod 600 talos-secrets-backup.yaml

Post-Deployment Configuration

Bootstrap with Helmfile

The recommended approach is to use the bootstrap helmfile which installs both Cilium CNI and ArgoCD in a single command:

cd scripts

# Bootstrap the cluster (install Cilium + ArgoCD)
helmfile -f helmfile-bootstrap.yaml -e testing sync

This will:

  1. Install Cilium CNI with Talos-specific settings
  2. Apply L2 IP Pool for LoadBalancer services
  3. Install ArgoCD in HA mode with KSOPS support for SOPS secrets
  4. Apply ArgoCD repository secrets (git, OCI registry)

Verify Cluster Health

# Check Talos cluster health
talosctl health --nodes 192.168.30.21

# View Talos dashboard
talosctl dashboard --nodes 192.168.30.21

# Check Kubernetes nodes
kubectl get nodes -o wide

# Verify Cilium is running
kubectl get pods -n kube-system -l app.kubernetes.io/name=cilium

# Check Cilium status
cilium status

# Verify all pods are running
kubectl get pods -A

# Check for pending CSRs
kubectl get csr

Expected output:

NAME          STATUS   ROLES           AGE   VERSION   INTERNAL-IP
rciis-cp-01   Ready    control-plane   10m   v1.32.0   192.168.30.21
rciis-cp-02   Ready    control-plane   10m   v1.32.0   192.168.30.22
rciis-cp-03   Ready    control-plane   10m   v1.32.0   192.168.30.23
rciis-wn-01   Ready    <none>          10m   v1.32.0   192.168.30.24
rciis-wn-02   Ready    <none>          10m   v1.32.0   192.168.30.25
rciis-wn-03   Ready    <none>          10m   v1.32.0   192.168.30.26

Access ArgoCD

Get the initial admin password:

kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d

Access ArgoCD UI via port-forward:

kubectl port-forward svc/argocd-server -n argocd 8080:443

Then open https://localhost:8080 and login with username admin and the password retrieved above.

Manual Installation (Alternative)

If you prefer to install components manually without helmfile:

# Install Cilium
helm repo add cilium https://helm.cilium.io/
helm install cilium cilium/cilium \
  --namespace kube-system \
  -f apps/infra/cilium/testing/values.yaml

# Apply L2 pool
kubectl apply -f apps/infra/cilium/testing/l2-pool.yaml

# Install ArgoCD
helm repo add argo https://argoproj.github.io/argo-helm
kubectl create namespace argocd
helm install argocd argo/argo-cd \
  --namespace argocd \
  -f apps/infra/argocd/testing/values.yaml

# Apply secrets (requires SOPS)
kustomize build --enable-alpha-plugins --enable-exec apps/infra/secrets/testing | kubectl apply -f -

Manual Talos Configuration

For manual configuration without Terraform, use talosctl directly.

Generate Talos Configuration

# Generate secrets
talosctl gen secrets -o secrets.yaml

# Generate configuration
talosctl gen config rciis-testing https://192.168.30.20:6443 \
  --with-secrets secrets.yaml \
  --output-dir ./talos-config

Create Node Patches

Create patch files for each node in talos/testing/patches/:

machine:
  kubelet:
    extraConfig:
      serverTLSBootstrap: true
  network:
    interfaces:
      - interface: eth0
        addresses:
          - 192.168.30.21/24
        routes:
          - network: 0.0.0.0/0
            gateway: 192.168.30.1
        vip:
          ip: 192.168.30.20
    nameservers:
      - 1.1.1.1
      - 192.168.10.17
  install:
    disk: /dev/sda
    image: ghcr.io/siderolabs/installer:v1.12.0
    wipe: true
  features:
    kubePrism:
      enabled: true
      port: 7445
cluster:
  network:
    cni:
      name: none
  proxy:
    disabled: true
  extraManifests:
    - https://raw.githubusercontent.com/alex1989hu/kubelet-serving-cert-approver/main/deploy/standalone-install.yaml
    - https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
machine:
  kubelet:
    extraConfig:
      serverTLSBootstrap: true
  network:
    interfaces:
      - interface: eth0
        addresses:
          - 192.168.30.24/24
        routes:
          - network: 0.0.0.0/0
            gateway: 192.168.30.1
    nameservers:
      - 1.1.1.1
      - 192.168.10.17
  install:
    disk: /dev/sda
    image: ghcr.io/siderolabs/installer:v1.12.0
    wipe: true
  features:
    kubePrism:
      enabled: true
      port: 7445

Apply Configuration

# Apply to control plane nodes
talosctl apply-config --insecure \
  --nodes 192.168.30.21 \
  --file talos-config/controlplane.yaml \
  --config-patch @talos/testing/patches/cp1.yaml

# Apply to worker nodes
talosctl apply-config --insecure \
  --nodes 192.168.30.24 \
  --file talos-config/worker.yaml \
  --config-patch @talos/testing/patches/wk1.yaml

# Bootstrap the cluster (run once on first control plane)
talosctl bootstrap --nodes 192.168.30.21

# Get kubeconfig
talosctl kubeconfig --nodes 192.168.30.21

Cluster Management

Talos Commands

# View cluster members
talosctl get members --nodes 192.168.30.21

# Check etcd status
talosctl etcd status --nodes 192.168.30.21

# View service status
talosctl services --nodes 192.168.30.21

# Stream kubelet logs
talosctl logs -f kubelet --nodes 192.168.30.21

# Interactive dashboard
talosctl dashboard --nodes 192.168.30.21

Upgrading Talos

# Check current version
talosctl version --nodes 192.168.30.21

# Upgrade control plane nodes (one at a time)
talosctl upgrade \
  --nodes 192.168.30.21 \
  --image ghcr.io/siderolabs/installer:v1.13.0

# Upgrade worker nodes
talosctl upgrade \
  --nodes 192.168.30.24 \
  --image ghcr.io/siderolabs/installer:v1.13.0

Scaling the Cluster

To add more nodes, update terraform.tfvars:

# Increase worker count
worker_count = 5

# Add new IPs
worker_ips = [
  "192.168.30.24/24",
  "192.168.30.25/24",
  "192.168.30.26/24",
  "192.168.30.27/24",  # New
  "192.168.30.28/24",  # New
]

Then apply:

tofu plan
tofu apply

Troubleshooting

Node Not Ready

# Check node status
kubectl describe node <node-name>

# Check kubelet logs
talosctl logs kubelet --nodes <node-ip>

# Check CNI pods
kubectl get pods -n kube-system -l app.kubernetes.io/name=cilium

VIP Not Responding

The VIP only becomes active after bootstrap. Before bootstrap, connect directly to a control plane IP:

# Direct connection before VIP is active
talosctl --nodes 192.168.30.21 health

etcd Issues

# Check etcd status
talosctl etcd status --nodes 192.168.30.21

# View etcd members
talosctl etcd members --nodes 192.168.30.21

# Check etcd logs
talosctl logs etcd --nodes 192.168.30.21

Metrics Server TLS Errors

If metrics-server shows certificate errors, ensure the kubelet patch includes:

machine:
  kubelet:
    extraConfig:
      serverTLSBootstrap: true

And the kubelet-serving-cert-approver is deployed:

cluster:
  extraManifests:
    - https://raw.githubusercontent.com/alex1989hu/kubelet-serving-cert-approver/main/deploy/standalone-install.yaml

Reset a Node

Data Loss Warning

This will wipe all data on the node.

# Reset a specific node
talosctl reset --nodes 192.168.30.24 --graceful=false

Terraform State Management

Configure S3 backend in main.tf:

terraform {
  backend "s3" {
    bucket         = "talos-terraform-state"
    key            = "testing/terraform.tfstate"
    region         = "af-south-1"
    encrypt        = true
    dynamodb_table = "terraform-locks"
  }
}

Importing Existing Resources

# Import existing VM
tofu import 'module.control_plane[0].proxmox_virtual_environment_vm.this' pve2/qemu/501

Destroy Cluster

# Destroy all resources
tofu destroy

# Destroy specific resources
tofu destroy -target=module.worker

Security Considerations

  1. Proxmox Credentials: Use TF_VAR_proxmox_password environment variable instead of storing in tfvars
  2. Machine Secrets: Backup and secure talos-secrets-backup.yaml - required for cluster recovery
  3. Network Isolation: Consider VLANs for production environments
  4. API Access: The VIP (192.168.30.20) should be accessible only from trusted networks

Reference

Terraform Providers

Provider Version Purpose
bpg/proxmox ~> 0.86.0 Proxmox VM management
siderolabs/talos ~> 0.9.0 Talos configuration