Operations Troubleshooting¶

Operational troubleshooting procedures for the RCIIS DevOps platform, focusing on deployment, infrastructure, and service issues.

Overview¶

This guide provides systematic troubleshooting approaches for operational issues, including deployment failures, service disruptions, and infrastructure problems.

Troubleshooting Methodology¶

Standard Operating Procedure¶

Assess Impact: Determine severity and affected systems
Gather Information: Collect logs, metrics, and status information
Identify Root Cause: Systematic elimination of potential causes
Implement Fix: Apply corrective measures
Verify Resolution: Confirm issue is resolved
Document: Record findings and preventive measures

Escalation Process¶

Level 1: Automated alerts and monitoring
Level 2: On-call engineer response
Level 3: Team lead and subject matter experts
Level 4: Management and external vendors

Deployment Troubleshooting¶

ArgoCD Sync Failures¶

Symptoms: - Applications stuck in "OutOfSync" state - Sync operations failing or timing out - Resource conflicts preventing deployment

Diagnosis:

# Check application status
argocd app list | grep -v Synced
argocd app get <app-name>

# View sync history
argocd app history <app-name>

# Check resource differences
argocd app diff <app-name>

# Check ArgoCD logs
kubectl logs -n argocd deployment/argocd-application-controller --tail=100

Common Solutions:

Resource Conflicts:

# Force sync with pruning
argocd app sync <app-name> --force --prune

# Delete conflicting resources
kubectl delete <resource-type> <resource-name> -n <namespace> --wait=true

# Refresh and retry
argocd app refresh <app-name>
argocd app sync <app-name>

Permission Issues:

# Check service account permissions
kubectl describe clusterrolebinding argocd-application-controller

# Verify namespace access
kubectl auth can-i "*" "*" --as=system:serviceaccount:argocd:argocd-application-controller -n <namespace>

Repository Access:

# Check repository connection
argocd repo list
argocd repo get <repo-url>

# Test SSH key
ssh -T [email protected]

# Update repository credentials
kubectl patch secret <repo-secret> -n argocd -p '{"data":{"sshPrivateKey":"<base64-encoded-key>"}}'

Helm Deployment Issues¶

Chart Installation Failures:

# Check Helm release status
helm list -A --all
helm status <release-name> -n <namespace>

# View release history
helm history <release-name> -n <namespace>

# Check for conflicts
helm template <release-name> <chart> --values values.yaml --dry-run

# Debug template rendering
helm template <release-name> <chart> --values values.yaml --debug

Resolution Steps:

# Rollback failed release
helm rollback <release-name> <revision> -n <namespace>

# Uninstall and reinstall
helm uninstall <release-name> -n <namespace>
helm install <release-name> <chart> -n <namespace> --values values.yaml

# Force upgrade
helm upgrade <release-name> <chart> -n <namespace> --values values.yaml --force

Kustomize Build Failures¶

SOPS/KSOPS Issues:

# Test Kustomize build
kustomize build --enable-alpha-plugins --enable-exec <path>

# Check KSOPS plugin
which ksops
echo $XDG_CONFIG_HOME/kustomize/plugin/viaduct.ai/v1/ksops/ksops

# Verify Age key
echo $SOPS_AGE_KEY_FILE
sops --decrypt <secret-file>

# Test decryption manually
export SOPS_AGE_KEY_FILE=~/.age/key.txt
sops --decrypt apps/rciis/secrets/staging/nucleus/appsettings.yaml

Service Troubleshooting¶

Pod Startup Issues¶

CrashLoopBackOff:

# Check pod status and restarts
kubectl get pods -n <namespace>
kubectl describe pod <pod-name> -n <namespace>

# Check logs from previous container
kubectl logs <pod-name> -n <namespace> --previous

# Check resource limits
kubectl describe pod <pod-name> -n <namespace> | grep -A 5 Limits

# Check liveness/readiness probes
kubectl describe pod <pod-name> -n <namespace> | grep -A 10 Liveness

Resolution Strategies:

# Increase resource limits
kubectl patch deployment <deployment> -n <namespace> -p '{"spec":{"template":{"spec":{"containers":[{"name":"<container>","resources":{"limits":{"memory":"1Gi","cpu":"500m"}}}]}}}}'

# Adjust probe timings
kubectl patch deployment <deployment> -n <namespace> -p '{"spec":{"template":{"spec":{"containers":[{"name":"<container>","livenessProbe":{"initialDelaySeconds":60}}]}}}}'

# Debug with different image
kubectl set image deployment/<deployment> <container>=busybox -n <namespace>
kubectl exec -it deployment/<deployment> -n <namespace> -- /bin/sh

Database Connection Issues¶

SQL Server Connectivity:

# Check SQL Server pod status
kubectl get pods -l app=mssql -n database

# Test connectivity from application pod
kubectl exec deployment/nucleus -n nucleus -- telnet mssql-service 1433

# Check connection string
kubectl get secret nucleus-database -o yaml -n nucleus | grep connection-string | base64 -d

# Test database query
kubectl exec -it mssql-0 -n database -- /opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P <password> -Q "SELECT 1"

Common Database Issues:

# Database not ready
kubectl logs mssql-0 -n database | grep "SQL Server is now ready"

# Connection pool exhaustion
kubectl exec deployment/nucleus -n nucleus -- netstat -an | grep 1433 | wc -l

# Lock issues
kubectl exec -it mssql-0 -n database -- /opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P <password> -Q "SELECT * FROM sys.dm_tran_locks"

Message Queue Issues¶

Kafka Connectivity Problems:

# Check Kafka cluster status
kubectl get kafka kafka-cluster -n kafka

# Check broker pods
kubectl get pods -l strimzi.io/cluster=kafka-cluster -n kafka

# Test producer
kubectl exec kafka-cluster-kafka-0 -n kafka -- bin/kafka-console-producer.sh --bootstrap-server localhost:9092 --topic test-topic

# Check consumer lag
kubectl exec kafka-cluster-kafka-0 -n kafka -- bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group <group-id>

Kafka Troubleshooting:

# Check topic status
kubectl get kafkatopic -n kafka
kubectl describe kafkatopic <topic-name> -n kafka

# Check user permissions
kubectl get kafkauser -n kafka
kubectl describe kafkauser <user-name> -n kafka

# View broker logs
kubectl logs kafka-cluster-kafka-0 -n kafka | tail -100

Network Troubleshooting¶

Ingress and Load Balancer Issues¶

Service Unavailable (503) Errors:

# Check ingress controller status
kubectl get pods -n ingress-nginx
kubectl logs deployment/ingress-nginx-controller -n ingress-nginx --tail=100

# Check service endpoints
kubectl get endpoints <service-name> -n <namespace>

# Check backend pod health
kubectl get pods -l app=<app-label> -n <namespace>
kubectl exec <pod-name> -n <namespace> -- curl localhost:8080/health

DNS Resolution Issues:

# Test DNS from pod
kubectl exec -it <pod-name> -n <namespace> -- nslookup <service-name>.<namespace>.svc.cluster.local

# Check CoreDNS
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs deployment/coredns -n kube-system

# Test external DNS
kubectl exec -it <pod-name> -n <namespace> -- nslookup google.com

Network Policy Debugging¶

Connection Blocked by Policy:

# Check network policies
kubectl get networkpolicy -A
kubectl describe networkpolicy <policy-name> -n <namespace>

# Test connectivity
kubectl exec <source-pod> -n <source-namespace> -- nc -zv <target-service> <port>

# Monitor Cilium policy drops (if using Cilium)
cilium monitor --type policy-verdict

# Temporarily disable network policies
kubectl delete networkpolicy --all -n <namespace>

Storage Troubleshooting¶

Persistent Volume Issues¶

PVC Pending State:

# Check PVC status
kubectl get pvc -A
kubectl describe pvc <pvc-name> -n <namespace>

# Check storage class
kubectl get storageclass
kubectl describe storageclass <storage-class>

# Check available PVs
kubectl get pv

Volume Mount Failures:

# Check pod events
kubectl describe pod <pod-name> -n <namespace> | grep -A 10 Events

# Check volume permissions
kubectl exec <pod-name> -n <namespace> -- ls -la /mount/path

# Check node disk space
kubectl describe node <node-name> | grep -A 5 Capacity

Security Troubleshooting¶

Certificate Issues¶

TLS Certificate Problems:

# Check certificate status
kubectl get certificates -A
kubectl describe certificate <cert-name> -n <namespace>

# Check cert-manager logs
kubectl logs deployment/cert-manager -n cert-manager --tail=100

# Test certificate
openssl s_client -connect <domain>:443 -servername <domain> < /dev/null

RBAC Permission Denied:

# Check current permissions
kubectl auth can-i <verb> <resource> --as=<user> -n <namespace>

# Check role bindings
kubectl get rolebinding,clusterrolebinding -A | grep <user-or-group>

# Describe role
kubectl describe role <role-name> -n <namespace>
kubectl describe clusterrole <clusterrole-name>

Performance Troubleshooting¶

High Resource Usage¶

Memory Issues:

# Check memory usage
kubectl top pods -A --sort-by=memory
kubectl top nodes

# Check memory limits
kubectl describe pod <pod-name> -n <namespace> | grep -A 5 Limits

# Check for memory leaks
kubectl exec <pod-name> -n <namespace> -- ps aux --sort=-%mem

CPU Issues:

# Check CPU usage
kubectl top pods -A --sort-by=cpu

# Check for CPU throttling
kubectl exec <pod-name> -n <namespace> -- cat /sys/fs/cgroup/cpu/cpu.stat

# Monitor process CPU usage
kubectl exec <pod-name> -n <namespace> -- top -p <pid>

Slow Response Times¶

Application Performance:

# Check application metrics
curl http://<pod-ip>:8080/metrics | grep http_request_duration

# Monitor database queries
kubectl exec -it mssql-0 -n database -- /opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P <password> -Q "SELECT * FROM sys.dm_exec_query_stats"

# Check network latency
kubectl exec <pod-name> -n <namespace> -- ping <target-service>

Emergency Procedures¶

Service Outage Response¶

Immediate Actions:

# Check overall cluster health
kubectl get nodes
kubectl get pods --all-namespaces | grep -v Running

# Scale up critical services
kubectl scale deployment <critical-service> --replicas=5 -n <namespace>

# Check ingress controller
kubectl get pods -n ingress-nginx
kubectl logs deployment/ingress-nginx-controller -n ingress-nginx --tail=50

Communication:

# Post status update
curl -X POST $SLACK_WEBHOOK -d '{"text":"🚨 SERVICE OUTAGE: Investigating connectivity issues"}'

# Update status page
# (Update external status page if available)

# Notify stakeholders
# Send email/SMS to key stakeholders

Data Recovery¶

Database Recovery:

# Stop application to prevent data corruption
kubectl scale deployment nucleus --replicas=0 -n nucleus

# Check backup status
kubectl get cronjob -n database
kubectl get job -l app=database-backup -n database

# Restore from latest backup
kubectl exec -it mssql-0 -n database -- /opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P <password> -Q "RESTORE DATABASE [NucleusDB] FROM DISK = '/backup/latest.bak' WITH REPLACE"

# Restart application
kubectl scale deployment nucleus --replicas=2 -n nucleus

Monitoring and Alerting¶

Alert Triage¶

Critical Alert Response:

# Check alert details
kubectl get prometheusrule -A
kubectl describe prometheusrule <rule-name> -n <namespace>

# Check AlertManager
kubectl get pods -n monitoring | grep alertmanager
kubectl logs alertmanager-0 -n monitoring

# Silence alerts temporarily
amtool silence add alertname="<alert-name>" --duration=1h --comment="Investigating issue"

Log Analysis¶

Centralized Logging:

# Search application logs
kubectl logs -l app=nucleus -n nucleus --tail=1000 | grep ERROR

# Export logs for analysis
kubectl logs deployment/nucleus -n nucleus --since=1h > /tmp/nucleus-logs.txt

# Search across namespaces
kubectl logs --all-containers=true --selector app=nucleus -A

Preventive Measures¶

Health Monitoring¶

Proactive Monitoring:

# Regular health checks
curl -f https://nucleus-staging.devops.africa/health
kubectl get componentstatuses

# Resource monitoring
kubectl top nodes
kubectl top pods -A

# Certificate expiry monitoring
kubectl get certificates -A -o custom-columns=NAME:.metadata.name,READY:.status.conditions[0].status,EXPIRY:.status.notAfter

Maintenance Windows¶

Planned Maintenance:

# Drain nodes for maintenance
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

# Update system components
helm upgrade <release> <chart> --values values.yaml

# Uncordon nodes
kubectl uncordon <node-name>

# Verify cluster health
kubectl get nodes
kubectl get pods --all-namespaces | grep -v Running

Documentation and Runbooks¶

Incident Documentation¶

Post-Incident Review: 1. Timeline: Detailed timeline of events 2. Root Cause: Identified root cause analysis 3. Impact: Assessment of impact and affected services 4. Resolution: Steps taken to resolve the issue 5. Prevention: Measures to prevent recurrence

Runbook Maintenance¶

Regular Updates: - Update procedures based on lessons learned - Test runbooks during maintenance windows - Keep contact information current - Review and update escalation procedures

For specific component troubleshooting, refer to the detailed troubleshooting guide and individual service documentation.