Skip to content

Operations Troubleshooting

Operational troubleshooting procedures for the RCIIS DevOps platform, focusing on deployment, infrastructure, and service issues.

Overview

This guide provides systematic troubleshooting approaches for operational issues, including deployment failures, service disruptions, and infrastructure problems.

Troubleshooting Methodology

Standard Operating Procedure

  1. Assess Impact: Determine severity and affected systems
  2. Gather Information: Collect logs, metrics, and status information
  3. Identify Root Cause: Systematic elimination of potential causes
  4. Implement Fix: Apply corrective measures
  5. Verify Resolution: Confirm issue is resolved
  6. Document: Record findings and preventive measures

Escalation Process

  • Level 1: Automated alerts and monitoring
  • Level 2: On-call engineer response
  • Level 3: Team lead and subject matter experts
  • Level 4: Management and external vendors

Deployment Troubleshooting

ArgoCD Sync Failures

Symptoms: - Applications stuck in "OutOfSync" state - Sync operations failing or timing out - Resource conflicts preventing deployment

Diagnosis:

# Check application status
argocd app list | grep -v Synced
argocd app get <app-name>

# View sync history
argocd app history <app-name>

# Check resource differences
argocd app diff <app-name>

# Check ArgoCD logs
kubectl logs -n argocd deployment/argocd-application-controller --tail=100

Common Solutions:

  1. Resource Conflicts:

    # Force sync with pruning
    argocd app sync <app-name> --force --prune
    
    # Delete conflicting resources
    kubectl delete <resource-type> <resource-name> -n <namespace> --wait=true
    
    # Refresh and retry
    argocd app refresh <app-name>
    argocd app sync <app-name>
    

  2. Permission Issues:

    # Check service account permissions
    kubectl describe clusterrolebinding argocd-application-controller
    
    # Verify namespace access
    kubectl auth can-i "*" "*" --as=system:serviceaccount:argocd:argocd-application-controller -n <namespace>
    

  3. Repository Access:

    # Check repository connection
    argocd repo list
    argocd repo get <repo-url>
    
    # Test SSH key
    ssh -T [email protected]
    
    # Update repository credentials
    kubectl patch secret <repo-secret> -n argocd -p '{"data":{"sshPrivateKey":"<base64-encoded-key>"}}'
    

Helm Deployment Issues

Chart Installation Failures:

# Check Helm release status
helm list -A --all
helm status <release-name> -n <namespace>

# View release history
helm history <release-name> -n <namespace>

# Check for conflicts
helm template <release-name> <chart> --values values.yaml --dry-run

# Debug template rendering
helm template <release-name> <chart> --values values.yaml --debug

Resolution Steps:

# Rollback failed release
helm rollback <release-name> <revision> -n <namespace>

# Uninstall and reinstall
helm uninstall <release-name> -n <namespace>
helm install <release-name> <chart> -n <namespace> --values values.yaml

# Force upgrade
helm upgrade <release-name> <chart> -n <namespace> --values values.yaml --force

Kustomize Build Failures

SOPS/KSOPS Issues:

# Test Kustomize build
kustomize build --enable-alpha-plugins --enable-exec <path>

# Check KSOPS plugin
which ksops
echo $XDG_CONFIG_HOME/kustomize/plugin/viaduct.ai/v1/ksops/ksops

# Verify Age key
echo $SOPS_AGE_KEY_FILE
sops --decrypt <secret-file>

# Test decryption manually
export SOPS_AGE_KEY_FILE=~/.age/key.txt
sops --decrypt apps/rciis/secrets/staging/nucleus/appsettings.yaml

Service Troubleshooting

Pod Startup Issues

CrashLoopBackOff:

# Check pod status and restarts
kubectl get pods -n <namespace>
kubectl describe pod <pod-name> -n <namespace>

# Check logs from previous container
kubectl logs <pod-name> -n <namespace> --previous

# Check resource limits
kubectl describe pod <pod-name> -n <namespace> | grep -A 5 Limits

# Check liveness/readiness probes
kubectl describe pod <pod-name> -n <namespace> | grep -A 10 Liveness

Resolution Strategies:

# Increase resource limits
kubectl patch deployment <deployment> -n <namespace> -p '{"spec":{"template":{"spec":{"containers":[{"name":"<container>","resources":{"limits":{"memory":"1Gi","cpu":"500m"}}}]}}}}'

# Adjust probe timings
kubectl patch deployment <deployment> -n <namespace> -p '{"spec":{"template":{"spec":{"containers":[{"name":"<container>","livenessProbe":{"initialDelaySeconds":60}}]}}}}'

# Debug with different image
kubectl set image deployment/<deployment> <container>=busybox -n <namespace>
kubectl exec -it deployment/<deployment> -n <namespace> -- /bin/sh

Database Connection Issues

SQL Server Connectivity:

# Check SQL Server pod status
kubectl get pods -l app=mssql -n database

# Test connectivity from application pod
kubectl exec deployment/nucleus -n nucleus -- telnet mssql-service 1433

# Check connection string
kubectl get secret nucleus-database -o yaml -n nucleus | grep connection-string | base64 -d

# Test database query
kubectl exec -it mssql-0 -n database -- /opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P <password> -Q "SELECT 1"

Common Database Issues:

# Database not ready
kubectl logs mssql-0 -n database | grep "SQL Server is now ready"

# Connection pool exhaustion
kubectl exec deployment/nucleus -n nucleus -- netstat -an | grep 1433 | wc -l

# Lock issues
kubectl exec -it mssql-0 -n database -- /opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P <password> -Q "SELECT * FROM sys.dm_tran_locks"

Message Queue Issues

Kafka Connectivity Problems:

# Check Kafka cluster status
kubectl get kafka kafka-cluster -n kafka

# Check broker pods
kubectl get pods -l strimzi.io/cluster=kafka-cluster -n kafka

# Test producer
kubectl exec kafka-cluster-kafka-0 -n kafka -- bin/kafka-console-producer.sh --bootstrap-server localhost:9092 --topic test-topic

# Check consumer lag
kubectl exec kafka-cluster-kafka-0 -n kafka -- bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group <group-id>

Kafka Troubleshooting:

# Check topic status
kubectl get kafkatopic -n kafka
kubectl describe kafkatopic <topic-name> -n kafka

# Check user permissions
kubectl get kafkauser -n kafka
kubectl describe kafkauser <user-name> -n kafka

# View broker logs
kubectl logs kafka-cluster-kafka-0 -n kafka | tail -100

Network Troubleshooting

Ingress and Load Balancer Issues

Service Unavailable (503) Errors:

# Check ingress controller status
kubectl get pods -n ingress-nginx
kubectl logs deployment/ingress-nginx-controller -n ingress-nginx --tail=100

# Check service endpoints
kubectl get endpoints <service-name> -n <namespace>

# Check backend pod health
kubectl get pods -l app=<app-label> -n <namespace>
kubectl exec <pod-name> -n <namespace> -- curl localhost:8080/health

DNS Resolution Issues:

# Test DNS from pod
kubectl exec -it <pod-name> -n <namespace> -- nslookup <service-name>.<namespace>.svc.cluster.local

# Check CoreDNS
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs deployment/coredns -n kube-system

# Test external DNS
kubectl exec -it <pod-name> -n <namespace> -- nslookup google.com

Network Policy Debugging

Connection Blocked by Policy:

# Check network policies
kubectl get networkpolicy -A
kubectl describe networkpolicy <policy-name> -n <namespace>

# Test connectivity
kubectl exec <source-pod> -n <source-namespace> -- nc -zv <target-service> <port>

# Monitor Cilium policy drops (if using Cilium)
cilium monitor --type policy-verdict

# Temporarily disable network policies
kubectl delete networkpolicy --all -n <namespace>

Storage Troubleshooting

Persistent Volume Issues

PVC Pending State:

# Check PVC status
kubectl get pvc -A
kubectl describe pvc <pvc-name> -n <namespace>

# Check storage class
kubectl get storageclass
kubectl describe storageclass <storage-class>

# Check available PVs
kubectl get pv

Volume Mount Failures:

# Check pod events
kubectl describe pod <pod-name> -n <namespace> | grep -A 10 Events

# Check volume permissions
kubectl exec <pod-name> -n <namespace> -- ls -la /mount/path

# Check node disk space
kubectl describe node <node-name> | grep -A 5 Capacity

Security Troubleshooting

Certificate Issues

TLS Certificate Problems:

# Check certificate status
kubectl get certificates -A
kubectl describe certificate <cert-name> -n <namespace>

# Check cert-manager logs
kubectl logs deployment/cert-manager -n cert-manager --tail=100

# Test certificate
openssl s_client -connect <domain>:443 -servername <domain> < /dev/null

RBAC Permission Denied:

# Check current permissions
kubectl auth can-i <verb> <resource> --as=<user> -n <namespace>

# Check role bindings
kubectl get rolebinding,clusterrolebinding -A | grep <user-or-group>

# Describe role
kubectl describe role <role-name> -n <namespace>
kubectl describe clusterrole <clusterrole-name>

Performance Troubleshooting

High Resource Usage

Memory Issues:

# Check memory usage
kubectl top pods -A --sort-by=memory
kubectl top nodes

# Check memory limits
kubectl describe pod <pod-name> -n <namespace> | grep -A 5 Limits

# Check for memory leaks
kubectl exec <pod-name> -n <namespace> -- ps aux --sort=-%mem

CPU Issues:

# Check CPU usage
kubectl top pods -A --sort-by=cpu

# Check for CPU throttling
kubectl exec <pod-name> -n <namespace> -- cat /sys/fs/cgroup/cpu/cpu.stat

# Monitor process CPU usage
kubectl exec <pod-name> -n <namespace> -- top -p <pid>

Slow Response Times

Application Performance:

# Check application metrics
curl http://<pod-ip>:8080/metrics | grep http_request_duration

# Monitor database queries
kubectl exec -it mssql-0 -n database -- /opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P <password> -Q "SELECT * FROM sys.dm_exec_query_stats"

# Check network latency
kubectl exec <pod-name> -n <namespace> -- ping <target-service>

Emergency Procedures

Service Outage Response

Immediate Actions:

# Check overall cluster health
kubectl get nodes
kubectl get pods --all-namespaces | grep -v Running

# Scale up critical services
kubectl scale deployment <critical-service> --replicas=5 -n <namespace>

# Check ingress controller
kubectl get pods -n ingress-nginx
kubectl logs deployment/ingress-nginx-controller -n ingress-nginx --tail=50

Communication:

# Post status update
curl -X POST $SLACK_WEBHOOK -d '{"text":"🚨 SERVICE OUTAGE: Investigating connectivity issues"}'

# Update status page
# (Update external status page if available)

# Notify stakeholders
# Send email/SMS to key stakeholders

Data Recovery

Database Recovery:

# Stop application to prevent data corruption
kubectl scale deployment nucleus --replicas=0 -n nucleus

# Check backup status
kubectl get cronjob -n database
kubectl get job -l app=database-backup -n database

# Restore from latest backup
kubectl exec -it mssql-0 -n database -- /opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P <password> -Q "RESTORE DATABASE [NucleusDB] FROM DISK = '/backup/latest.bak' WITH REPLACE"

# Restart application
kubectl scale deployment nucleus --replicas=2 -n nucleus

Monitoring and Alerting

Alert Triage

Critical Alert Response:

# Check alert details
kubectl get prometheusrule -A
kubectl describe prometheusrule <rule-name> -n <namespace>

# Check AlertManager
kubectl get pods -n monitoring | grep alertmanager
kubectl logs alertmanager-0 -n monitoring

# Silence alerts temporarily
amtool silence add alertname="<alert-name>" --duration=1h --comment="Investigating issue"

Log Analysis

Centralized Logging:

# Search application logs
kubectl logs -l app=nucleus -n nucleus --tail=1000 | grep ERROR

# Export logs for analysis
kubectl logs deployment/nucleus -n nucleus --since=1h > /tmp/nucleus-logs.txt

# Search across namespaces
kubectl logs --all-containers=true --selector app=nucleus -A

Preventive Measures

Health Monitoring

Proactive Monitoring:

# Regular health checks
curl -f https://nucleus-staging.devops.africa/health
kubectl get componentstatuses

# Resource monitoring
kubectl top nodes
kubectl top pods -A

# Certificate expiry monitoring
kubectl get certificates -A -o custom-columns=NAME:.metadata.name,READY:.status.conditions[0].status,EXPIRY:.status.notAfter

Maintenance Windows

Planned Maintenance:

# Drain nodes for maintenance
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

# Update system components
helm upgrade <release> <chart> --values values.yaml

# Uncordon nodes
kubectl uncordon <node-name>

# Verify cluster health
kubectl get nodes
kubectl get pods --all-namespaces | grep -v Running

Documentation and Runbooks

Incident Documentation

Post-Incident Review: 1. Timeline: Detailed timeline of events 2. Root Cause: Identified root cause analysis 3. Impact: Assessment of impact and affected services 4. Resolution: Steps taken to resolve the issue 5. Prevention: Measures to prevent recurrence

Runbook Maintenance

Regular Updates: - Update procedures based on lessons learned - Test runbooks during maintenance windows - Keep contact information current - Review and update escalation procedures

For specific component troubleshooting, refer to the detailed troubleshooting guide and individual service documentation.