Backup and Restore¶
Comprehensive backup and restore strategy for the RCIIS DevOps platform, ensuring data protection and disaster recovery capabilities.
Overview¶
The backup strategy covers all critical components including databases, persistent volumes, configuration data, and secrets.
Backup Components¶
Data Sources¶
- Application Databases: SQL Server data and transaction logs
- Persistent Volumes: Application data and file storage
- Configuration Data: Kubernetes manifests and configurations
- Secrets: Encrypted secrets and certificates
- Kafka Topics: Message queue data and offsets
Backup Types¶
- Full Backups: Complete data snapshots
- Incremental Backups: Changed data only
- Transaction Log Backups: Database transaction logs
- Configuration Backups: Git repository snapshots
Database Backup¶
SQL Server Backup Strategy¶
# Database backup job
apiVersion: batch/v1
kind: CronJob
metadata:
name: mssql-backup
namespace: database
spec:
schedule: "0 2 * * *" # Daily at 2 AM
jobTemplate:
spec:
template:
spec:
containers:
- name: backup
image: mcr.microsoft.com/mssql-tools
command:
- /bin/bash
- -c
- |
sqlcmd -S mssql-service -U sa -P $SA_PASSWORD \
-Q "BACKUP DATABASE [NucleusDB] TO DISK = '/backup/nucleus_$(date +%Y%m%d_%H%M%S).bak'"
env:
- name: SA_PASSWORD
valueFrom:
secretKeyRef:
name: mssql-secret
key: password
volumeMounts:
- name: backup-storage
mountPath: /backup
volumes:
- name: backup-storage
persistentVolumeClaim:
claimName: backup-pvc
restartPolicy: OnFailure
Automated Backup Verification¶
# Backup verification script
#!/bin/bash
BACKUP_FILE="/backup/nucleus_$(date +%Y%m%d)*.bak"
VERIFY_RESULT=$(sqlcmd -S mssql-service -U sa -P $SA_PASSWORD \
-Q "RESTORE VERIFYONLY FROM DISK = '$BACKUP_FILE'" -h -1)
if [[ $VERIFY_RESULT == *"successfully processed"* ]]; then
echo "Backup verification successful"
# Send success notification
curl -X POST $SLACK_WEBHOOK -d '{"text":"Database backup verified successfully"}'
else
echo "Backup verification failed"
# Send failure alert
curl -X POST $SLACK_WEBHOOK -d '{"text":"⚠️ Database backup verification failed!"}'
exit 1
fi
Persistent Volume Backup¶
Volume Snapshot Strategy¶
# Volume snapshot class
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: csi-snapclass
driver: pd.csi.storage.gke.io
deletionPolicy: Retain
---
# Volume snapshot
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: nucleus-data-snapshot
namespace: nucleus
spec:
volumeSnapshotClassName: csi-snapclass
source:
persistentVolumeClaimName: nucleus-data-pvc
Automated Volume Backup¶
# Volume backup cronjob
apiVersion: batch/v1
kind: CronJob
metadata:
name: volume-backup
namespace: nucleus
spec:
schedule: "0 1 * * *" # Daily at 1 AM
jobTemplate:
spec:
template:
spec:
containers:
- name: backup
image: velero/velero:latest
command:
- velero
- backup
- create
- nucleus-volumes-$(date +%Y%m%d)
- --include-namespaces=nucleus
- --storage-location=default
restartPolicy: OnFailure
Configuration Backup¶
Git Repository Backup¶
# Repository backup script
#!/bin/bash
REPO_URL="[email protected]:MagnaBC/rciis-devops.git"
BACKUP_DIR="/backup/git"
DATE=$(date +%Y%m%d_%H%M%S)
# Clone repository
git clone --mirror $REPO_URL $BACKUP_DIR/rciis-devops-$DATE.git
# Create archive
tar -czf $BACKUP_DIR/rciis-devops-$DATE.tar.gz -C $BACKUP_DIR rciis-devops-$DATE.git
# Upload to object storage
aws s3 cp $BACKUP_DIR/rciis-devops-$DATE.tar.gz s3://backup-bucket/git-backups/
# Cleanup local files
rm -rf $BACKUP_DIR/rciis-devops-$DATE.git
rm -f $BACKUP_DIR/rciis-devops-$DATE.tar.gz
Kubernetes Configuration Backup¶
# Export all Kubernetes resources
#!/bin/bash
BACKUP_DIR="/backup/k8s"
DATE=$(date +%Y%m%d_%H%M%S)
# Create backup directory
mkdir -p $BACKUP_DIR/$DATE
# Export all resources by namespace
for ns in $(kubectl get namespaces -o name | cut -d/ -f2); do
mkdir -p $BACKUP_DIR/$DATE/$ns
# Export common resources
for resource in deployments services configmaps secrets ingresses; do
kubectl get $resource -n $ns -o yaml > $BACKUP_DIR/$DATE/$ns/$resource.yaml
done
done
# Export cluster-wide resources
kubectl get clusterroles -o yaml > $BACKUP_DIR/$DATE/clusterroles.yaml
kubectl get clusterrolebindings -o yaml > $BACKUP_DIR/$DATE/clusterrolebindings.yaml
kubectl get persistentvolumes -o yaml > $BACKUP_DIR/$DATE/persistentvolumes.yaml
# Create archive
tar -czf $BACKUP_DIR/k8s-config-$DATE.tar.gz -C $BACKUP_DIR $DATE
# Upload to object storage
aws s3 cp $BACKUP_DIR/k8s-config-$DATE.tar.gz s3://backup-bucket/k8s-backups/
Secret Backup¶
SOPS Secret Backup¶
# Backup encrypted secrets
#!/bin/bash
BACKUP_DIR="/backup/secrets"
DATE=$(date +%Y%m%d_%H%M%S)
# Create backup directory
mkdir -p $BACKUP_DIR/$DATE
# Find all SOPS-encrypted files
find apps/rciis/secrets -name "*.yaml" -type f | while read file; do
# Preserve directory structure
rel_path=$(echo $file | sed 's|apps/rciis/secrets/||')
dest_dir="$BACKUP_DIR/$DATE/$(dirname $rel_path)"
mkdir -p "$dest_dir"
# Copy encrypted file
cp "$file" "$dest_dir/"
done
# Create archive
tar -czf $BACKUP_DIR/secrets-$DATE.tar.gz -C $BACKUP_DIR $DATE
# Upload to object storage (encrypted)
gpg --cipher-algo AES256 --compress-algo 1 --symmetric --output $BACKUP_DIR/secrets-$DATE.tar.gz.gpg $BACKUP_DIR/secrets-$DATE.tar.gz
aws s3 cp $BACKUP_DIR/secrets-$DATE.tar.gz.gpg s3://backup-bucket/secret-backups/
Kafka Backup¶
Topic Data Backup¶
# Kafka topic backup
#!/bin/bash
KAFKA_CLUSTER="kafka-cluster-kafka-bootstrap:9092"
BACKUP_DIR="/backup/kafka"
DATE=$(date +%Y%m%d_%H%M%S)
# Create backup directory
mkdir -p $BACKUP_DIR/$DATE
# Get list of topics
kubectl exec kafka-cluster-kafka-0 -n kafka -- \
bin/kafka-topics.sh --bootstrap-server $KAFKA_CLUSTER --list > $BACKUP_DIR/$DATE/topics.txt
# Backup each topic
while read topic; do
kubectl exec kafka-cluster-kafka-0 -n kafka -- \
bin/kafka-console-consumer.sh \
--bootstrap-server $KAFKA_CLUSTER \
--topic $topic \
--from-beginning \
--timeout-ms 30000 > $BACKUP_DIR/$DATE/${topic}.json 2>/dev/null
done < $BACKUP_DIR/$DATE/topics.txt
# Create archive
tar -czf $BACKUP_DIR/kafka-topics-$DATE.tar.gz -C $BACKUP_DIR $DATE
Restore Procedures¶
Database Restore¶
# SQL Server database restore
#!/bin/bash
BACKUP_FILE="$1"
DATABASE_NAME="NucleusDB"
if [ -z "$BACKUP_FILE" ]; then
echo "Usage: $0 <backup_file>"
exit 1
fi
# Restore database
sqlcmd -S mssql-service -U sa -P $SA_PASSWORD -Q "
RESTORE DATABASE [$DATABASE_NAME]
FROM DISK = '$BACKUP_FILE'
WITH REPLACE,
MOVE '${DATABASE_NAME}_Data' TO '/var/opt/mssql/data/${DATABASE_NAME}.mdf',
MOVE '${DATABASE_NAME}_Log' TO '/var/opt/mssql/data/${DATABASE_NAME}.ldf'
"
# Verify restore
sqlcmd -S mssql-service -U sa -P $SA_PASSWORD -Q "
SELECT name, state_desc FROM sys.databases WHERE name = '$DATABASE_NAME'
"
Volume Restore¶
# Restore from volume snapshot
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: nucleus-data-restored
namespace: nucleus
spec:
storageClassName: standard
dataSource:
name: nucleus-data-snapshot
kind: VolumeSnapshot
apiGroup: snapshot.storage.k8s.io
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
Application Restore¶
# Application restore procedure
#!/bin/bash
BACKUP_DATE="$1"
if [ -z "$BACKUP_DATE" ]; then
echo "Usage: $0 <backup_date>"
exit 1
fi
# Stop application
kubectl scale deployment nucleus --replicas=0 -n nucleus
# Restore database
restore_database.sh "/backup/nucleus_${BACKUP_DATE}.bak"
# Restore persistent volumes
kubectl apply -f restore-pvc-${BACKUP_DATE}.yaml
# Update application to use restored volumes
kubectl patch deployment nucleus -n nucleus -p '{"spec":{"template":{"spec":{"volumes":[{"name":"data","persistentVolumeClaim":{"claimName":"nucleus-data-restored"}}]}}}}'
# Start application
kubectl scale deployment nucleus --replicas=2 -n nucleus
# Verify restoration
kubectl wait --for=condition=ready pod -l app=nucleus -n nucleus --timeout=300s
Disaster Recovery¶
Recovery Time Objective (RTO)¶
- Critical Systems: 15 minutes
- Standard Systems: 1 hour
- Development Systems: 4 hours
Recovery Point Objective (RPO)¶
- Database: 15 minutes (transaction log backups)
- Configuration: 24 hours (daily backups)
- File Storage: 1 hour (incremental backups)
DR Testing¶
# Disaster recovery test
#!/bin/bash
echo "Starting DR test at $(date)"
# 1. Simulate failure
kubectl delete namespace nucleus --wait=true
# 2. Restore from backup
restore_application.sh $(date -d "yesterday" +%Y%m%d)
# 3. Verify functionality
test_application_health.sh
# 4. Document results
echo "DR test completed at $(date)" >> /var/log/dr-tests.log
Monitoring and Alerting¶
Backup Monitoring¶
# Backup monitoring alert
groups:
- name: backup.rules
rules:
- alert: BackupFailed
expr: kube_job_status_failed{job_name=~".*backup.*"} > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Backup job {{ $labels.job_name }} failed"
description: "Backup job has been failing for more than 5 minutes"
- alert: BackupNotRun
expr: time() - kube_job_status_completion_time{job_name=~".*backup.*"} > 86400
labels:
severity: warning
annotations:
summary: "Backup job {{ $labels.job_name }} not run in 24 hours"
description: "Backup job should run daily"
Best Practices¶
Backup Strategy¶
- 3-2-1 Rule: 3 copies, 2 different media, 1 offsite
- Regular Testing: Monthly restore tests
- Encryption: Encrypt backups in transit and at rest
- Retention: Define clear retention policies
Operational Procedures¶
- Documentation: Maintain current runbooks
- Training: Regular DR training exercises
- Monitoring: Automated backup verification
- Compliance: Meet regulatory requirements
For specific restore procedures, refer to the individual component documentation.