Proxmox Talos Kubernetes Cluster¶
This guide covers deploying a production-ready Talos Kubernetes cluster on Proxmox VE using Infrastructure as Code (OpenTofu/Terraform).
Overview¶
The testing environment uses:
- Proxmox VE: Virtualization platform
- Talos Linux: Immutable, minimal OS designed for Kubernetes
- OpenTofu/Terraform: Infrastructure as Code for repeatable deployments
- Cilium: eBPF-based CNI with kube-proxy replacement
Cluster Architecture¶
┌─────────────────────────────────────────────────────────────────────┐
│ Proxmox VE Host (pve2) │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ CP-01 │ │ CP-02 │ │ CP-03 │ │
│ │ 192.168.30.21│ │ 192.168.30.22│ │ 192.168.30.23│ │
│ │ VM ID: 501 │ │ VM ID: 502 │ │ VM ID: 503 │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │ │
│ └────────────────┼────────────────┘ │
│ │ │
│ VIP: 192.168.30.20 │
│ (Kubernetes API) │
│ │ │
│ ┌────────────────┼────────────────┐ │
│ │ │ │ │
│ ┌──────┴──────┐ ┌──────┴──────┐ ┌──────┴──────┐ │
│ │ WN-01 │ │ WN-02 │ │ WN-03 │ │
│ │ 192.168.30.24│ │ 192.168.30.25│ │ 192.168.30.26│ │
│ │ VM ID: 504 │ │ VM ID: 505 │ │ VM ID: 506 │ │
│ │ +100GB data │ │ +100GB data │ │ +100GB data │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
Prerequisites¶
Software Requirements¶
| Tool | Version | Purpose |
|---|---|---|
| OpenTofu/Terraform | >= 1.5 | Infrastructure provisioning |
| talosctl | v1.12.0 | Talos cluster management |
| kubectl | >= 1.32 | Kubernetes management |
| helm | >= 3.x | Cilium installation |
Proxmox Requirements¶
- Proxmox VE version 8.x or later
- Storage pool with sufficient space (recommended: 500GB+)
- Network bridge (vmbr0) configured
- Talos nocloud template (VM ID 9001)
Network Requirements¶
| Resource | IP/Range | Purpose |
|---|---|---|
| Control Plane VIP | 192.168.30.20 | Kubernetes API HA endpoint |
| Control Plane Nodes | 192.168.30.21-23 | etcd, API server, scheduler |
| Worker Nodes | 192.168.30.24-26 | Application workloads |
| Gateway | 192.168.30.1 | Network gateway |
| Pod CIDR | 100.64.0.0/16 | Pod networking |
| Service CIDR | 172.20.0.0/16 | Service networking |
Proxmox Setup¶
Step 1: Download Talos Image¶
Download the Talos nocloud image for Proxmox:
# Download the latest Talos nocloud image
wget https://github.com/siderolabs/talos/releases/download/v1.12.0/nocloud-amd64.raw.xz
# Extract the image
xz -d nocloud-amd64.raw.xz
Step 2: Create Template VM¶
Create a template VM in Proxmox that will be cloned for all nodes:
# SSH to Proxmox host
ssh [email protected]
# Create a new VM for the template
qm create 9001 --name talos-template --memory 2048 --cores 2 --net0 virtio,bridge=vmbr0
# Import the Talos disk
qm importdisk 9001 /path/to/nocloud-amd64.raw local-lvm
# Attach the disk to the VM
qm set 9001 --scsihw virtio-scsi-pci --scsi0 local-lvm:vm-9001-disk-0
# Set boot order
qm set 9001 --boot order=scsi0
# Enable QEMU guest agent
qm set 9001 --agent enabled=1
# Add cloud-init drive
qm set 9001 --ide2 local-lvm:cloudinit
# Convert to template
qm template 9001
Step 3: Verify Template¶
Terraform Deployment¶
Directory Structure¶
terraform/
├── environments/
│ └── testing/
│ ├── main.tf # Main configuration
│ ├── variables.tf # Variable definitions
│ ├── outputs.tf # Output definitions
│ ├── terraform.tfvars # Variable values
│ └── terraform.tfvars.example
└── modules/
├── proxmox/
│ └── vm/ # Proxmox VM module
└── talos/
└── proxmox/ # Talos configuration module
Step 1: Configure Variables¶
Create or update terraform.tfvars:
# Proxmox Connection
proxmox_endpoint = "https://192.168.30.10:8006"
proxmox_username = "root@pam"
proxmox_password = "your-password" # Use TF_VAR_proxmox_password
proxmox_insecure = true
proxmox_ssh_username = "root"
proxmox_node = "pve2"
# Cluster Configuration
cluster_name = "rciis-testing"
control_plane_count = 3
worker_count = 3
# VM IDs
control_plane_vm_id_start = 501
worker_vm_id_start = 504
template_vm_id = 9001
# Control Plane Resources
control_plane_cpu_cores = 2
control_plane_memory_mb = 4096
control_plane_disk_size_gb = 50
# Worker Resources
worker_cpu_cores = 2
worker_memory_mb = 4096
worker_disk_size_gb = 50
worker_data_disk_size_gb = 100 # Additional data disk
# Network
network_bridge = "vmbr0"
ipv4_gateway = "192.168.30.1"
dns_servers = ["1.1.1.1", "192.168.10.17"]
control_plane_ips = [
"192.168.30.21/24",
"192.168.30.22/24",
"192.168.30.23/24",
]
worker_ips = [
"192.168.30.24/24",
"192.168.30.25/24",
"192.168.30.26/24",
]
# High Availability
control_plane_vip = "192.168.30.20"
control_plane_vip_interface = "eth0"
# Versions
talos_version = "v1.12.0"
kubernetes_version = "1.32.0"
# CNI Configuration
cni_name = "none" # Install Cilium manually
disable_kube_proxy = true # Cilium replaces kube-proxy
tags = ["rciis", "testing"]
Step 2: Initialize and Plan¶
Expected output summary:
Plan: 21 to add, 0 to change, 0 to destroy.
Resources to create:
- 3 control plane VMs (501-503)
- 3 worker VMs (504-506) with data disks
- Talos machine secrets
- Talos machine configurations
- Talos bootstrap
- Kubeconfig and talosconfig
Step 3: Apply Configuration¶
Deployment Time
Initial deployment takes 10-15 minutes. The process:
- Clones VMs from template (~2 min per VM)
- Configures cloud-init networking
- Applies Talos machine configuration
- Bootstraps the cluster
- Waits for kubeconfig
Step 4: Save Credentials¶
After successful deployment, save the cluster credentials:
# Create config directories
mkdir -p ~/.talos ~/.kube
# Save talosconfig
tofu output -raw talosconfig > ~/.talos/config
# Save kubeconfig
tofu output -raw kubeconfig > ~/.kube/config
# Backup machine secrets (store securely!)
tofu output -raw machine_secrets > talos-secrets-backup.yaml
chmod 600 talos-secrets-backup.yaml
Post-Deployment Configuration¶
Bootstrap with Helmfile¶
The recommended approach is to use the bootstrap helmfile which installs both Cilium CNI and ArgoCD in a single command:
cd scripts
# Bootstrap the cluster (install Cilium + ArgoCD)
helmfile -f helmfile-bootstrap.yaml -e testing sync
This will:
- Install Cilium CNI with Talos-specific settings
- Apply L2 IP Pool for LoadBalancer services
- Install ArgoCD in HA mode with KSOPS support for SOPS secrets
- Apply ArgoCD repository secrets (git, OCI registry)
Verify Cluster Health¶
# Check Talos cluster health
talosctl health --nodes 192.168.30.21
# View Talos dashboard
talosctl dashboard --nodes 192.168.30.21
# Check Kubernetes nodes
kubectl get nodes -o wide
# Verify Cilium is running
kubectl get pods -n kube-system -l app.kubernetes.io/name=cilium
# Check Cilium status
cilium status
# Verify all pods are running
kubectl get pods -A
# Check for pending CSRs
kubectl get csr
Expected output:
NAME STATUS ROLES AGE VERSION INTERNAL-IP
rciis-cp-01 Ready control-plane 10m v1.32.0 192.168.30.21
rciis-cp-02 Ready control-plane 10m v1.32.0 192.168.30.22
rciis-cp-03 Ready control-plane 10m v1.32.0 192.168.30.23
rciis-wn-01 Ready <none> 10m v1.32.0 192.168.30.24
rciis-wn-02 Ready <none> 10m v1.32.0 192.168.30.25
rciis-wn-03 Ready <none> 10m v1.32.0 192.168.30.26
Access ArgoCD¶
Get the initial admin password:
Access ArgoCD UI via port-forward:
Then open https://localhost:8080 and login with username admin and the password retrieved above.
Manual Installation (Alternative)¶
If you prefer to install components manually without helmfile:
# Install Cilium
helm repo add cilium https://helm.cilium.io/
helm install cilium cilium/cilium \
--namespace kube-system \
-f apps/infra/cilium/testing/values.yaml
# Apply L2 pool
kubectl apply -f apps/infra/cilium/testing/l2-pool.yaml
# Install ArgoCD
helm repo add argo https://argoproj.github.io/argo-helm
kubectl create namespace argocd
helm install argocd argo/argo-cd \
--namespace argocd \
-f apps/infra/argocd/testing/values.yaml
# Apply secrets (requires SOPS)
kustomize build --enable-alpha-plugins --enable-exec apps/infra/secrets/testing | kubectl apply -f -
Manual Talos Configuration¶
For manual configuration without Terraform, use talosctl directly.
Generate Talos Configuration¶
# Generate secrets
talosctl gen secrets -o secrets.yaml
# Generate configuration
talosctl gen config rciis-testing https://192.168.30.20:6443 \
--with-secrets secrets.yaml \
--output-dir ./talos-config
Create Node Patches¶
Create patch files for each node in talos/testing/patches/:
machine:
kubelet:
extraConfig:
serverTLSBootstrap: true
network:
interfaces:
- interface: eth0
addresses:
- 192.168.30.21/24
routes:
- network: 0.0.0.0/0
gateway: 192.168.30.1
vip:
ip: 192.168.30.20
nameservers:
- 1.1.1.1
- 192.168.10.17
install:
disk: /dev/sda
image: ghcr.io/siderolabs/installer:v1.12.0
wipe: true
features:
kubePrism:
enabled: true
port: 7445
cluster:
network:
cni:
name: none
proxy:
disabled: true
extraManifests:
- https://raw.githubusercontent.com/alex1989hu/kubelet-serving-cert-approver/main/deploy/standalone-install.yaml
- https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
machine:
kubelet:
extraConfig:
serverTLSBootstrap: true
network:
interfaces:
- interface: eth0
addresses:
- 192.168.30.24/24
routes:
- network: 0.0.0.0/0
gateway: 192.168.30.1
nameservers:
- 1.1.1.1
- 192.168.10.17
install:
disk: /dev/sda
image: ghcr.io/siderolabs/installer:v1.12.0
wipe: true
features:
kubePrism:
enabled: true
port: 7445
Apply Configuration¶
# Apply to control plane nodes
talosctl apply-config --insecure \
--nodes 192.168.30.21 \
--file talos-config/controlplane.yaml \
--config-patch @talos/testing/patches/cp1.yaml
# Apply to worker nodes
talosctl apply-config --insecure \
--nodes 192.168.30.24 \
--file talos-config/worker.yaml \
--config-patch @talos/testing/patches/wk1.yaml
# Bootstrap the cluster (run once on first control plane)
talosctl bootstrap --nodes 192.168.30.21
# Get kubeconfig
talosctl kubeconfig --nodes 192.168.30.21
Cluster Management¶
Talos Commands¶
# View cluster members
talosctl get members --nodes 192.168.30.21
# Check etcd status
talosctl etcd status --nodes 192.168.30.21
# View service status
talosctl services --nodes 192.168.30.21
# Stream kubelet logs
talosctl logs -f kubelet --nodes 192.168.30.21
# Interactive dashboard
talosctl dashboard --nodes 192.168.30.21
Upgrading Talos¶
# Check current version
talosctl version --nodes 192.168.30.21
# Upgrade control plane nodes (one at a time)
talosctl upgrade \
--nodes 192.168.30.21 \
--image ghcr.io/siderolabs/installer:v1.13.0
# Upgrade worker nodes
talosctl upgrade \
--nodes 192.168.30.24 \
--image ghcr.io/siderolabs/installer:v1.13.0
Scaling the Cluster¶
To add more nodes, update terraform.tfvars:
# Increase worker count
worker_count = 5
# Add new IPs
worker_ips = [
"192.168.30.24/24",
"192.168.30.25/24",
"192.168.30.26/24",
"192.168.30.27/24", # New
"192.168.30.28/24", # New
]
Then apply:
Troubleshooting¶
Node Not Ready¶
# Check node status
kubectl describe node <node-name>
# Check kubelet logs
talosctl logs kubelet --nodes <node-ip>
# Check CNI pods
kubectl get pods -n kube-system -l app.kubernetes.io/name=cilium
VIP Not Responding¶
The VIP only becomes active after bootstrap. Before bootstrap, connect directly to a control plane IP:
etcd Issues¶
# Check etcd status
talosctl etcd status --nodes 192.168.30.21
# View etcd members
talosctl etcd members --nodes 192.168.30.21
# Check etcd logs
talosctl logs etcd --nodes 192.168.30.21
Metrics Server TLS Errors¶
If metrics-server shows certificate errors, ensure the kubelet patch includes:
And the kubelet-serving-cert-approver is deployed:
cluster:
extraManifests:
- https://raw.githubusercontent.com/alex1989hu/kubelet-serving-cert-approver/main/deploy/standalone-install.yaml
Reset a Node¶
Data Loss Warning
This will wipe all data on the node.
Terraform State Management¶
Remote State (Recommended for Production)¶
Configure S3 backend in main.tf:
terraform {
backend "s3" {
bucket = "talos-terraform-state"
key = "testing/terraform.tfstate"
region = "af-south-1"
encrypt = true
dynamodb_table = "terraform-locks"
}
}
Importing Existing Resources¶
# Import existing VM
tofu import 'module.control_plane[0].proxmox_virtual_environment_vm.this' pve2/qemu/501
Destroy Cluster¶
# Destroy all resources
tofu destroy
# Destroy specific resources
tofu destroy -target=module.worker
Security Considerations¶
- Proxmox Credentials: Use
TF_VAR_proxmox_passwordenvironment variable instead of storing in tfvars - Machine Secrets: Backup and secure
talos-secrets-backup.yaml- required for cluster recovery - Network Isolation: Consider VLANs for production environments
- API Access: The VIP (192.168.30.20) should be accessible only from trusted networks
Reference¶
Terraform Providers¶
| Provider | Version | Purpose |
|---|---|---|
| bpg/proxmox | ~> 0.86.0 | Proxmox VM management |
| siderolabs/talos | ~> 0.9.0 | Talos configuration |