Infrastructure Monitoring¶
Monitoring infrastructure components and cluster health for the RCIIS DevOps platform.
Overview¶
Infrastructure monitoring provides visibility into cluster health, resource utilization, and system performance across all environments.
Monitoring Stack¶
Core Components¶
Prometheus Stack: - Prometheus: Metrics collection and storage - Grafana: Visualization and dashboards - AlertManager: Alert routing and notification - Node Exporter: System metrics collection - kube-state-metrics: Kubernetes object metrics
Additional Tools: - Cilium Hubble: Network observability - Loki: Log aggregation - Fluent-bit: Log shipping and processing
Deployment Configuration¶
Prometheus Operator:
# Prometheus operator installation
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: prometheus-operator
namespace: argocd
spec:
source:
repoURL: https://prometheus-community.github.io/helm-charts
chart: kube-prometheus-stack
targetRevision: "45.0.0"
helm:
values: |
prometheus:
prometheusSpec:
retention: 15d
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: standard
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 50Gi
grafana:
persistence:
enabled: true
storageClassName: standard
size: 10Gi
adminPassword: admin123
Cluster Metrics¶
Node Monitoring¶
Node Resource Metrics: - CPU utilization and load average - Memory usage and available memory - Disk usage and I/O statistics - Network traffic and error rates - Temperature and hardware health
Key Metrics:
# Node CPU utilization
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Node memory utilization
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
# Node disk utilization
(1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100
# Node load average
node_load1 / on(instance) count by (instance) (node_cpu_seconds_total{mode="idle"})
Kubernetes Metrics¶
Cluster Health Metrics:
# Pod restart rate
rate(kube_pod_container_status_restarts_total[5m])
# Pod memory usage
container_memory_usage_bytes{name!=""}
# Pod CPU usage
rate(container_cpu_usage_seconds_total{name!=""}[5m])
# Cluster node status
kube_node_status_condition{condition="Ready",status="true"}
Resource Utilization:
# Namespace CPU requests vs limits
sum by (namespace) (kube_pod_container_resource_requests{resource="cpu"})
sum by (namespace) (kube_pod_container_resource_limits{resource="cpu"})
# Namespace memory requests vs limits
sum by (namespace) (kube_pod_container_resource_requests{resource="memory"})
sum by (namespace) (kube_pod_container_resource_limits{resource="memory"})
# Persistent volume usage
kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes
Application Metrics¶
Service Monitoring¶
ServiceMonitor Configuration:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: nucleus-metrics
namespace: nucleus
labels:
app: nucleus
spec:
selector:
matchLabels:
app: nucleus
endpoints:
- port: metrics
interval: 30s
path: /metrics
honorLabels: true
Application Metrics:
# HTTP request rate
rate(http_requests_total[5m])
# HTTP request duration
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# HTTP error rate
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])
# Database connection pool
database_connections_active / database_connections_max
Custom Metrics¶
.NET Application Metrics:
// Custom metrics in .NET applications
public static class CustomMetrics
{
public static readonly Counter ProcessedDeclarations = Metrics
.CreateCounter("nucleus_declarations_processed_total",
"Total number of processed declarations");
public static readonly Histogram ProcessingDuration = Metrics
.CreateHistogram("nucleus_processing_duration_seconds",
"Declaration processing duration");
public static readonly Gauge ActiveConnections = Metrics
.CreateGauge("nucleus_active_connections",
"Number of active database connections");
}
Infrastructure Alerts¶
Critical Alerts¶
Node Health Alerts:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: infrastructure-alerts
namespace: monitoring
spec:
groups:
- name: node.rules
rules:
- alert: NodeDown
expr: up{job="node-exporter"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Node {{ $labels.instance }} is down"
description: "Node has been down for more than 5 minutes"
- alert: NodeHighCPU
expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is {{ $value }}%"
- alert: NodeHighMemory
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is {{ $value }}%"
- alert: NodeDiskSpaceLow
expr: (1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100 > 90
for: 5m
labels:
severity: critical
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "Disk usage is {{ $value }}%"
Kubernetes Alerts:
- name: kubernetes.rules
rules:
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.pod }} is crash looping"
description: "Pod has restarted {{ $value }} times in the last 15 minutes"
- alert: PodNotReady
expr: kube_pod_status_ready{condition="false"} == 1
for: 10m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.pod }} not ready"
description: "Pod has been in not ready state for more than 10 minutes"
- alert: DeploymentReplicasMismatch
expr: kube_deployment_spec_replicas != kube_deployment_status_ready_replicas
for: 10m
labels:
severity: warning
annotations:
summary: "Deployment {{ $labels.deployment }} has mismatched replicas"
description: "Deployment has {{ $labels.spec_replicas }} desired but {{ $labels.ready_replicas }} ready"
Network Monitoring¶
Cilium Hubble¶
Network Observability:
# Enable Hubble UI
helm upgrade cilium cilium/cilium \
--namespace kube-system \
--set hubble.enabled=true \
--set hubble.relay.enabled=true \
--set hubble.ui.enabled=true
# Access Hubble UI
kubectl port-forward -n kube-system svc/hubble-ui 12000:80
Network Flow Monitoring:
# Monitor network flows
hubble observe --follow
# Monitor specific namespace
hubble observe --namespace nucleus
# Monitor denied traffic
hubble observe --verdict DENIED
# Monitor by pod
hubble observe --from-pod nucleus/nucleus-api-12345
Network Metrics¶
Cilium Metrics:
# Network policy drops
rate(cilium_policy_verdict_total{verdict="denied"}[5m])
# Network latency
histogram_quantile(0.95, rate(cilium_network_latency_seconds_bucket[5m]))
# Connection rate
rate(cilium_connections_total[5m])
# Bandwidth utilization
rate(cilium_bytes_total[5m])
Storage Monitoring¶
Persistent Volume Metrics¶
Storage Health:
# PV utilization
kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes * 100
# PV availability
kube_persistentvolume_status_phase{phase="Available"}
# PVC pending
kube_persistentvolumeclaim_status_phase{phase="Pending"}
# Storage I/O operations
rate(node_disk_io_time_seconds_total[5m])
Storage Alerts:
- name: storage.rules
rules:
- alert: PersistentVolumeUsageHigh
expr: kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High PV usage on {{ $labels.persistentvolumeclaim }}"
description: "PV usage is {{ $value }}%"
- alert: PersistentVolumeClaimPending
expr: kube_persistentvolumeclaim_status_phase{phase="Pending"} == 1
for: 10m
labels:
severity: critical
annotations:
summary: "PVC {{ $labels.persistentvolumeclaim }} is pending"
description: "PVC has been pending for more than 10 minutes"
Grafana Dashboards¶
Infrastructure Dashboard¶
Node Overview Dashboard:
{
"dashboard": {
"title": "Infrastructure Overview",
"panels": [
{
"title": "Node CPU Usage",
"type": "stat",
"targets": [
{
"expr": "100 - (avg by (instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)"
}
]
},
{
"title": "Node Memory Usage",
"type": "stat",
"targets": [
{
"expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100"
}
]
},
{
"title": "Pod Status",
"type": "piechart",
"targets": [
{
"expr": "count by (phase) (kube_pod_status_phase)"
}
]
}
]
}
}
Cluster Health Dashboard¶
Kubernetes Overview: - Cluster resource utilization - Pod distribution across nodes - Namespace resource usage - Failed pods and restarts - Network policy violations
Alert Management¶
AlertManager Configuration¶
Alert Routing:
global:
smtp_smarthost: 'smtp.gmail.com:587'
smtp_from: '[email protected]'
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'web.hook'
routes:
- match:
severity: critical
receiver: 'critical-alerts'
- match:
severity: warning
receiver: 'warning-alerts'
receivers:
- name: 'web.hook'
webhook_configs:
- url: 'http://webhook.site/#!/...'
- name: 'critical-alerts'
slack_configs:
- api_url: 'https://hooks.slack.com/services/...'
channel: '#alerts-critical'
title: 'Critical Alert'
text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
- name: 'warning-alerts'
slack_configs:
- api_url: 'https://hooks.slack.com/services/...'
channel: '#alerts-warning'
title: 'Warning Alert'
text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
Notification Channels¶
Slack Integration:
# Slack notification configuration
- name: 'slack-alerts'
slack_configs:
- api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'
channel: '#monitoring'
username: 'AlertManager'
icon_emoji: ':warning:'
title: 'Alert: {{ .GroupLabels.alertname }}'
text: |
{{ range .Alerts }}
*Summary:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}
*Severity:* {{ .Labels.severity }}
{{ end }}
Performance Monitoring¶
Resource Optimization¶
Resource Usage Analysis:
# Resource efficiency by namespace
(sum by (namespace) (rate(container_cpu_usage_seconds_total[5m]))) /
(sum by (namespace) (kube_pod_container_resource_requests{resource="cpu"}))
# Memory efficiency by namespace
(sum by (namespace) (container_memory_usage_bytes)) /
(sum by (namespace) (kube_pod_container_resource_requests{resource="memory"}))
Capacity Planning:
# Cluster CPU capacity
sum(kube_node_status_capacity{resource="cpu"})
# Cluster memory capacity
sum(kube_node_status_capacity{resource="memory"})
# Utilization trends
predict_linear(node_memory_MemAvailable_bytes[1h], 24*3600)
Troubleshooting¶
Common Monitoring Issues¶
Metrics Not Appearing:
# Check ServiceMonitor
kubectl get servicemonitor -A
# Check Prometheus targets
kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090
# Visit http://localhost:9090/targets
# Check service labels
kubectl get svc -l app=nucleus --show-labels
High Cardinality Issues:
# Check metric cardinality
curl http://prometheus:9090/api/v1/label/__name__/values | jq '.data[]' | wc -l
# Identify high cardinality metrics
promtool query instant 'topk(10, count by (__name__)({__name__=~".+"}))'
Diagnostic Commands¶
# Check Prometheus configuration
kubectl get prometheus -o yaml
# Check AlertManager status
kubectl get alertmanager -A
# Check Grafana pods
kubectl get pods -l app.kubernetes.io/name=grafana
# Test metric endpoint
kubectl exec -it <pod-name> -- curl http://localhost:8080/metrics
For advanced monitoring configurations and best practices, refer to the Prometheus and Grafana documentation.