Skip to content

Troubleshooting Guide

Comprehensive troubleshooting guide for common issues in the RCIIS DevOps platform.

General Troubleshooting Approach

Diagnostic Methodology

  1. Identify symptoms: Gather error messages and logs
  2. Isolate the problem: Narrow down the scope
  3. Check recent changes: Review recent deployments or configurations
  4. Verify dependencies: Ensure all required services are running
  5. Apply fixes: Implement solutions systematically
  6. Verify resolution: Confirm the issue is resolved

Essential Commands

# Check cluster status
kubectl get nodes
kubectl get pods --all-namespaces
kubectl top nodes
kubectl top pods --all-namespaces

# Check specific namespace
kubectl get all -n <namespace>
kubectl describe pod <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --tail=100

# Check events
kubectl get events --sort-by=.metadata.creationTimestamp -n <namespace>
kubectl get events --all-namespaces --sort-by=.metadata.creationTimestamp

Application Issues

Pod Not Starting

Symptoms: - Pod stuck in Pending, CrashLoopBackOff, or ImagePullBackOff state - Application not responding to health checks

Diagnosis:

# Check pod status and events
kubectl describe pod <pod-name> -n <namespace>

# Check logs
kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous

# Check resource availability
kubectl describe node <node-name>
kubectl top nodes

Common Causes and Solutions:

  1. Insufficient Resources

    # Check resource requests vs available
    kubectl describe node <node-name>
    
    # Solution: Reduce resource requests or add nodes
    kubectl patch deployment <deployment> -p '{"spec":{"template":{"spec":{"containers":[{"name":"<container>","resources":{"requests":{"memory":"256Mi","cpu":"100m"}}}]}}}}'
    

  2. Image Pull Issues

    # Check image name and registry access
    kubectl describe pod <pod-name> -n <namespace>
    
    # Solution: Verify image exists and credentials are correct
    kubectl create secret docker-registry harbor-registry \
      --docker-server=harbor.devops.africa \
      --docker-username=<username> \
      --docker-password=<password> \
      -n <namespace>
    

  3. Configuration Issues

    # Check configmaps and secrets
    kubectl get configmap -n <namespace>
    kubectl get secret -n <namespace>
    
    # Solution: Verify configuration exists and is correctly mounted
    kubectl describe configmap <configmap-name> -n <namespace>
    

Service Connection Issues

Symptoms: - Services unable to communicate - DNS resolution failures - Connection timeouts

Diagnosis:

# Check service endpoints
kubectl get endpoints <service-name> -n <namespace>

# Test DNS resolution
kubectl run debug --image=busybox -i --tty --rm -- /bin/sh
# Inside pod:
nslookup <service-name>.<namespace>.svc.cluster.local
wget -qO- http://<service-name>.<namespace>:8080/health

# Check network policies
kubectl get networkpolicy -n <namespace>

Common Solutions:

  1. Service Selector Mismatch

    # Check service selector matches pod labels
    kubectl get service <service-name> -o yaml
    kubectl get pods -l <label-selector> -n <namespace>
    

  2. Network Policy Blocking

    # Temporarily disable network policies for testing
    kubectl delete networkpolicy --all -n <namespace>
    
    # Re-apply with correct rules
    kubectl apply -f correct-network-policy.yaml
    

  3. Port Configuration

    # Verify service ports match container ports
    kubectl describe service <service-name> -n <namespace>
    kubectl describe pod <pod-name> -n <namespace>
    

Infrastructure Issues

ArgoCD Sync Failures

Symptoms: - Applications stuck in OutOfSync state - Sync operations failing - Resource conflicts

Diagnosis:

# Check application status
argocd app list
argocd app get <app-name>

# Check sync history
argocd app history <app-name>

# Check resource differences
argocd app diff <app-name>

Common Solutions:

  1. Resource Conflicts

    # Force sync with pruning
    argocd app sync <app-name> --force --prune
    
    # Delete conflicting resources manually
    kubectl delete <resource-type> <resource-name> -n <namespace>
    

  2. RBAC Issues

    # Check ArgoCD service account permissions
    kubectl describe clusterrolebinding argocd-application-controller
    
    # Verify project permissions
    argocd proj get <project-name>
    

  3. Repository Access

    # Check repository connection
    argocd repo list
    argocd repo get <repo-url>
    
    # Update repository credentials
    argocd repo add <repo-url> --ssh-private-key-path ~/.ssh/id_rsa
    

Certificate Issues

Symptoms: - SSL/TLS connection failures - Certificate not found errors - Expired certificate warnings

Diagnosis:

# Check certificate status
kubectl get certificate -A
kubectl describe certificate <cert-name> -n <namespace>

# Check cert-manager logs
kubectl logs -n cert-manager deployment/cert-manager

# Check certificate details
kubectl get secret <cert-secret> -o yaml -n <namespace>

Common Solutions:

  1. Certificate Not Issued

    # Check certificate challenges
    kubectl get challenge -A
    kubectl describe challenge <challenge-name> -n <namespace>
    
    # Check DNS resolution
    nslookup <domain-name>
    
    # Force certificate renewal
    kubectl delete certificate <cert-name> -n <namespace>
    

  2. ClusterIssuer Issues

    # Check cluster issuer status
    kubectl describe clusterissuer <issuer-name>
    
    # Verify ACME configuration
    kubectl get secret <issuer-secret> -o yaml -n cert-manager
    

Storage Issues

Symptoms: - Pods stuck in Pending with volume mount errors - Database connection failures - File system errors

Diagnosis:

# Check persistent volumes and claims
kubectl get pv,pvc -A

# Check storage class
kubectl get storageclass

# Check volume mount issues
kubectl describe pod <pod-name> -n <namespace>

Common Solutions:

  1. Volume Not Available

    # Check PVC status
    kubectl describe pvc <pvc-name> -n <namespace>
    
    # Verify storage class exists
    kubectl get storageclass <storage-class>
    
    # Create missing storage class
    kubectl apply -f storage-class.yaml
    

  2. Permission Issues

    # Fix volume permissions
    kubectl exec <pod-name> -n <namespace> -- chown -R 1001:1001 /data
    
    # Use init container for permission fix
    kubectl patch deployment <deployment> -p '{"spec":{"template":{"spec":{"initContainers":[{"name":"fix-permissions","image":"busybox","command":["chown","-R","1001:1001","/data"],"volumeMounts":[{"name":"data","mountPath":"/data"}]}]}}}}'
    

Database Issues

SQL Server Connection Problems

Symptoms: - Application unable to connect to database - Login failures - Connection timeout errors

Diagnosis:

# Check SQL Server pod status
kubectl get pods -l app=mssql -n database

# Check SQL Server logs
kubectl logs <mssql-pod> -n database

# Test connection
kubectl exec -it <mssql-pod> -n database -- /opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P <password>

Common Solutions:

  1. Connection String Issues

    # Verify connection string secret
    kubectl get secret <db-secret> -o yaml -n <namespace>
    
    # Test connectivity from application pod
    kubectl exec <app-pod> -n <namespace> -- telnet <mssql-service> 1433
    

  2. Authentication Issues

    # Reset SA password
    kubectl exec <mssql-pod> -n database -- /opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P <old-password> -Q "ALTER LOGIN sa WITH PASSWORD='<new-password>'"
    
    # Update secret with new password
    kubectl patch secret <db-secret> -p '{"data":{"password":"<base64-encoded-password>"}}' -n <namespace>
    

  3. Database Not Ready

    # Check database initialization
    kubectl exec <mssql-pod> -n database -- /opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P <password> -Q "SELECT name FROM sys.databases"
    
    # Run database migrations
    kubectl exec <app-pod> -n <namespace> -- dotnet ef database update
    

Message Queue Issues

Kafka Connection Problems

Symptoms: - Producers unable to send messages - Consumers not receiving messages - Broker connection failures

Diagnosis:

# Check Kafka cluster status
kubectl get kafka kafka-cluster -n kafka

# Check Kafka pods
kubectl get pods -l strimzi.io/cluster=kafka-cluster -n kafka

# Check topic status
kubectl get kafkatopic -n kafka

# Test producer/consumer
kubectl exec kafka-cluster-kafka-0 -n kafka -- bin/kafka-console-producer.sh --bootstrap-server localhost:9092 --topic test-topic

Common Solutions:

  1. Broker Not Ready

    # Check broker logs
    kubectl logs kafka-cluster-kafka-0 -n kafka
    
    # Restart brokers if needed
    kubectl delete pod kafka-cluster-kafka-0 -n kafka
    

  2. Topic Issues

    # List topics
    kubectl exec kafka-cluster-kafka-0 -n kafka -- bin/kafka-topics.sh --bootstrap-server localhost:9092 --list
    
    # Create missing topic
    kubectl apply -f - <<EOF
    apiVersion: kafka.strimzi.io/v1beta2
    kind: KafkaTopic
    metadata:
      name: <topic-name>
      namespace: kafka
    spec:
      partitions: 3
      replicas: 2
    EOF
    

  3. Consumer Group Issues

    # Check consumer group status
    kubectl exec kafka-cluster-kafka-0 -n kafka -- bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group <group-id>
    
    # Reset consumer group offset
    kubectl exec kafka-cluster-kafka-0 -n kafka -- bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --group <group-id> --reset-offsets --to-earliest --topic <topic-name> --execute
    

Security Issues

SOPS Decryption Failures

Symptoms: - Secrets not decrypted in pods - KSOPS plugin errors - Age key issues

Diagnosis:

# Test SOPS decryption manually
sops --decrypt apps/rciis/secrets/staging/nucleus/appsettings.yaml

# Check Age key
echo $SOPS_AGE_KEY_FILE
cat $SOPS_AGE_KEY_FILE

# Check KSOPS plugin
kustomize build --enable-alpha-plugins apps/rciis/nucleus/staging/

Common Solutions:

  1. Missing Age Key

    # Generate new Age key
    age-keygen -o ~/.age/key.txt
    
    # Update SOPS configuration
    export SOPS_AGE_KEY_FILE=~/.age/key.txt
    
    # Re-encrypt secrets with new key
    sops updatekeys apps/rciis/secrets/staging/nucleus/appsettings.yaml
    

  2. KSOPS Plugin Issues

    # Install KSOPS plugin
    curl -Lo ksops https://github.com/viaduct-ai/kustomize-sops/releases/latest/download/ksops_linux_amd64
    chmod +x ksops
    sudo mv ksops /usr/local/bin/
    
    # Verify plugin
    kustomize plugin list
    

RBAC Permission Issues

Symptoms: - Access denied errors - ServiceAccount permission failures - Unauthorized API calls

Diagnosis:

# Check current permissions
kubectl auth can-i <verb> <resource> --as=system:serviceaccount:<namespace>:<serviceaccount>

# Check role bindings
kubectl get rolebinding,clusterrolebinding -A | grep <serviceaccount>

# Check role definitions
kubectl describe role <role-name> -n <namespace>

Common Solutions:

  1. Missing Permissions

    # Create role with required permissions
    kubectl create role <role-name> --verb=get,list,watch --resource=pods,services -n <namespace>
    
    # Bind role to service account
    kubectl create rolebinding <binding-name> --role=<role-name> --serviceaccount=<namespace>:<serviceaccount> -n <namespace>
    

  2. ClusterRole Issues

    # Check cluster role
    kubectl describe clusterrole <clusterrole-name>
    
    # Update cluster role
    kubectl patch clusterrole <clusterrole-name> --type='json' -p='[{"op":"add","path":"/rules/-","value":{"apiGroups":[""],"resources":["secrets"],"verbs":["get","list"]}}]'
    

Performance Issues

High Resource Usage

Symptoms: - Pods being OOMKilled - High CPU usage - Slow response times

Diagnosis:

# Check resource usage
kubectl top pods -A --sort-by=memory
kubectl top pods -A --sort-by=cpu

# Check resource limits
kubectl describe pod <pod-name> -n <namespace>

# Check node resources
kubectl describe node <node-name>

Common Solutions:

  1. Memory Issues

    # Increase memory limits
    kubectl patch deployment <deployment> -p '{"spec":{"template":{"spec":{"containers":[{"name":"<container>","resources":{"limits":{"memory":"2Gi"}}}]}}}}'
    
    # Enable horizontal pod autoscaling
    kubectl autoscale deployment <deployment> --cpu-percent=70 --min=2 --max=10 -n <namespace>
    

  2. CPU Issues

    # Increase CPU limits
    kubectl patch deployment <deployment> -p '{"spec":{"template":{"spec":{"containers":[{"name":"<container>","resources":{"limits":{"cpu":"1000m"}}}]}}}}'
    
    # Check for CPU throttling
    kubectl exec <pod-name> -n <namespace> -- cat /sys/fs/cgroup/cpu/cpu.stat
    

Network Performance

Symptoms: - Slow network communications - High latency between services - Packet loss

Diagnosis:

# Test network connectivity
kubectl exec <pod-a> -n <namespace> -- ping <pod-b-ip>
kubectl exec <pod-a> -n <namespace> -- iperf3 -c <service-name> -p 5201

# Check Cilium status
cilium status
cilium connectivity test

# Monitor network traffic
kubectl exec <pod-name> -n <namespace> -- tcpdump -i eth0

Solutions:

# Restart Cilium agents
kubectl delete pods -l k8s-app=cilium -n kube-system

# Check CNI configuration
kubectl describe node <node-name>

# Optimize network policies
kubectl get networkpolicy -A

Emergency Procedures

Complete Service Outage

  1. Immediate Response

    # Check cluster status
    kubectl get nodes
    kubectl get pods --all-namespaces | grep -v Running
    
    # Check critical services
    kubectl get pods -n argocd
    kubectl get pods -n ingress-nginx
    kubectl get pods -n cert-manager
    

  2. Rollback Procedures

    # Rollback ArgoCD application
    argocd app rollback <app-name> <revision-id>
    
    # Rollback Kubernetes deployment
    kubectl rollout undo deployment/<deployment-name> -n <namespace>
    
    # Scale down problematic deployment
    kubectl scale deployment <deployment-name> --replicas=0 -n <namespace>
    

  3. Communication

    # Post incident status
    curl -X POST $SLACK_WEBHOOK -d '{"text":"🚨 INCIDENT: <description> - Investigating"}'
    
    # Update status page
    # (Update external status page if available)
    

Data Recovery

  1. Database Recovery
    # Stop application
    kubectl scale deployment <app-deployment> --replicas=0 -n <namespace>
    
    # Restore from backup
    kubectl exec <mssql-pod> -n database -- /opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P <password> -Q "RESTORE DATABASE [DB] FROM DISK = '/backup/latest.bak' WITH REPLACE"
    
    # Verify restoration
    kubectl exec <mssql-pod> -n database -- /opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P <password> -Q "SELECT COUNT(*) FROM [Table]"
    
    # Restart application
    kubectl scale deployment <app-deployment> --replicas=2 -n <namespace>
    

Monitoring and Alerting

Setting Up Alerts

# Critical alert rules
groups:
- name: critical.rules
  rules:
  - alert: PodCrashLooping
    expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Pod {{ $labels.pod }} is crash looping"

  - alert: HighMemoryUsage
    expr: container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.9
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High memory usage on {{ $labels.pod }}"

Log Analysis

# Aggregate error logs
kubectl logs -l app=nucleus -n nucleus --tail=1000 | grep ERROR

# Export logs for analysis
kubectl logs <pod-name> -n <namespace> --since=1h > /tmp/pod-logs.txt

# Search for specific patterns
kubectl logs -l app=nucleus -n nucleus | grep -E "(Exception|Error|Failed)"

For specific component troubleshooting, refer to the individual service documentation.