24 KiB
Operations Runbook
Table of Contents
- Overview
- Cluster Bootstrap
- Day-to-Day Operations
- Application Management
- Secret Management
- Monitoring & Alerting
- Troubleshooting
- Disaster Recovery
- Maintenance Procedures
Overview
This runbook provides operational procedures for maintaining the Kubernetes cluster and managing applications. It's intended for platform engineers and operators with full cluster access.
Operator Prerequisites
- ✅ Full kubectl access to cluster
- ✅ Write access to all Git repositories
- ✅ ArgoCD UI access
- ✅ Slack notifications configured
- ✅ Understanding of Kubernetes concepts
Cluster Bootstrap
Initial Cluster Setup
Bootstrap a new cluster from scratch:
Prerequisites
- Kubernetes cluster running (UpCloud or any K8s cluster)
- kubectl configured with admin access
- Repositories cloned locally
# Verify cluster access
kubectl cluster-info
kubectl get nodes
Bootstrap Procedure
# 1. Clone config repository
git clone https://github.com/snothub/sturdy-adventure.git
cd sturdy-adventure
# 2. Set cluster name (optional)
export CLUSTER_NAME="prod-cluster-01"
# 3. Run bootstrap script
./bootstrap.sh
What Happens:
- ✅ Installs ArgoCD via Helm
- ✅ Configures ArgoCD with custom values
- ✅ Applies root App-of-Apps manifest
- ✅ ArgoCD automatically syncs all applications
- ✅ Infrastructure and apps deploy in waves
Verify Bootstrap
# Wait for ArgoCD to be ready
kubectl wait --for=condition=available --timeout=300s \
deployment/argocd-server -n argocd
# Check ArgoCD applications
kubectl get applications -n argocd
# Expected output: infrastructure-apps, enterprise-apps, and all child apps
Post-Bootstrap Steps
-
Configure DNS for ingress domains:
argocd.127.0.0.1.nip.io(local dev)*.forteapps.net(production)
-
Verify Let's Encrypt certificates:
kubectl get certificate --all-namespaces kubectl get clusterissuer -
Check Kyverno policies:
kubectl get clusterpolicy -
Verify monitoring stack:
kubectl get pods -n monitoring -
Test Slack notifications by triggering a sync
Day-to-Day Operations
Monitoring ArgoCD Sync Status
Via Slack
All applications send notifications to shared Slack channel:
- ✅
on-sync-succeeded- Deployment succeeded - ❌
on-sync-failed- Deployment failed - ⚠️
on-degraded- Application unhealthy
Via CLI
# List all applications
kubectl get applications -n argocd
# Watch application status
kubectl get applications -n argocd -w
# Get detailed status
kubectl describe application myapp -n argocd
Via ArgoCD UI
# Port forward to UI
kubectl port-forward svc/argocd-server -n argocd 8080:443
# Access: https://localhost:8080
# No login required (insecure mode for internal use)
Checking Application Health
# Quick health check for all apps
kubectl get applications -n argocd \
-o custom-columns=NAME:.metadata.name,SYNC:.status.sync.status,HEALTH:.status.health.status
# Expected output:
# NAME SYNC HEALTH
# infrastructure-apps Synced Healthy
# enterprise-apps Synced Healthy
# mcp10x Synced Healthy
# musicman Synced Healthy
Manual Sync
Force sync an application:
# Trigger sync
kubectl patch application myapp -n argocd \
--type merge \
-p '{"metadata":{"annotations":{"argocd.argoproj.io/refresh":"hard"}}}'
# Or via ArgoCD CLI (if installed)
argocd app sync myapp
Pausing Auto-Sync
Temporarily disable automatic syncing:
# Edit application
kubectl edit application myapp -n argocd
# Set automated to null
spec:
syncPolicy:
automated: null # Disable auto-sync
# Re-enable later
spec:
syncPolicy:
automated:
prune: true
selfHeal: true
Application Management
Deploying a New Application
See Developer Guide for detailed steps.
Quick checklist:
- Create
helm-values/myapp/values.yaml - Create
apps/myapp.yamlin config repo - Create SealedSecret if needed
- Commit and push changes
- Verify sync in Slack/ArgoCD
- Configure DNS for domain
- Test application accessibility
Removing an Application
Safe Removal Procedure
# 1. Delete ArgoCD Application (with cascade)
kubectl delete application myapp -n argocd
# This will:
# - Remove application from ArgoCD
# - Delete all Kubernetes resources (cascade)
# - Remove namespace
# 2. Clean up Git repositories
cd ~/dev/k8s/launchpad
git rm apps/myapp.yaml
git commit -m "Remove myapp application"
git push
cd ~/dev/k8s/helm-prod-values
git rm -r myapp/
git commit -m "Remove myapp values"
git push
# 3. Remove sealed secrets (if any)
cd ~/dev/k8s/launchpad
git rm secrets/myapp-credentials-sealed.yaml
git commit -m "Remove myapp secrets"
git push
Removal Without Cascade
To remove from ArgoCD but keep resources running:
# Delete application with no cascade
kubectl patch application myapp -n argocd \
-p '{"metadata":{"finalizers":[]}}' --type merge
kubectl delete application myapp -n argocd
# Resources remain in cluster but are no longer managed
Scaling Applications
Manual Scaling
# Scale deployment directly
kubectl scale deployment myapp -n myapp --replicas=3
# Note: If selfHeal is enabled, this will be reverted
GitOps Scaling
Update helm-values/myapp/values.yaml:
app:
replicaCount: 3 # Change from 1 to 3
Commit and push - ArgoCD will sync.
Auto-Scaling (HPA)
Enable Horizontal Pod Autoscaler:
# In helm-values/myapp/values.yaml
app:
hpa:
enabled: true
minReplicas: 2
maxReplicas: 10
targetCPUUtilizationPercentage: 70
Note: Remove replicaCount from ArgoCD ignore list if using HPA:
# In apps/myapp.yaml
ignoreDifferences:
- group: apps
kind: Deployment
jsonPointers:
- /spec/replicas # Remove this line
Rolling Back Deployments
Option 1: Git Revert
# Find the commit before the bad change
cd ~/dev/k8s/helm-prod-values
git log --oneline myapp/values.yaml
# Revert to previous version
git revert <commit-hash>
git push
# ArgoCD will sync the rollback
Option 2: Manual Rollback
# Rollback to previous revision
kubectl rollout undo deployment myapp -n myapp
# Note: This will be reverted by ArgoCD selfHeal
# Make permanent by updating Git
Option 3: Change Image Tag
# Edit helm-values
cd ~/dev/k8s/helm-prod-values
vim myapp/values.yaml
# Change image tag to previous version
app:
image:
tag: v1.0.0 # Roll back from v1.0.1
# Commit and push
git add myapp/values.yaml
git commit -m "Rollback myapp to v1.0.0"
git push
Resource Updates
Update Resource Limits
# In helm-values/myapp/values.yaml
app:
resources:
requests:
cpu: 200m # Increased from 100m
memory: 512Mi # Increased from 256Mi
limits:
cpu: 1000m
memory: 2Gi
Enable Database
# In helm-values/myapp/values.yaml
db:
enabled: true
persistence:
size: 10Gi # Increase storage
Secret Management
Creating Secrets
Step 1: Get Public Certificate
# Fetch sealed-secrets public cert (one-time)
kubeseal --fetch-cert \
--controller-name=sealed-secrets-controller \
--controller-namespace=kube-system \
> pub-cert.pem
# Save this certificate for future use
Step 2: Create Plain Secret
# Method 1: From literal values
kubectl create secret generic myapp-credentials \
--from-literal=API_KEY=secret123 \
--from-literal=DB_PASSWORD=pass456 \
--namespace=myapp \
--dry-run=client -o yaml > private/myapp-credentials.yaml
# Method 2: From file
kubectl create secret generic myapp-credentials \
--from-file=.env \
--namespace=myapp \
--dry-run=client -o yaml > private/myapp-credentials.yaml
# Method 3: From multiple files
kubectl create secret generic myapp-credentials \
--from-file=api-key.txt \
--from-file=db-password.txt \
--namespace=myapp \
--dry-run=client -o yaml > private/myapp-credentials.yaml
Step 3: Seal Secret
kubeseal --format=yaml \
--cert=pub-cert.pem \
--namespace=myapp \
< private/myapp-credentials.yaml \
> secrets/myapp-credentials-sealed.yaml
Step 4: Commit Sealed Secret
git add secrets/myapp-credentials-sealed.yaml
git commit -m "Add myapp credentials"
git push
# Delete plain secret
rm private/myapp-credentials.yaml
Updating Secrets
# 1. Create new version
kubectl create secret generic myapp-credentials \
--from-literal=API_KEY=new-secret-key \
--from-literal=DB_PASSWORD=new-password \
--namespace=myapp \
--dry-run=client -o yaml > private/myapp-credentials.yaml
# 2. Seal it
kubeseal --format=yaml \
--cert=pub-cert.pem \
--namespace=myapp \
< private/myapp-credentials.yaml \
> secrets/myapp-credentials-sealed.yaml
# 3. Commit
git add secrets/myapp-credentials-sealed.yaml
git commit -m "Update myapp credentials"
git push
# 4. Restart pods to pick up new secret
kubectl rollout restart deployment myapp -n myapp
# 5. Delete plain secret
rm private/myapp-credentials.yaml
Viewing Secrets (Unsealed)
# List secrets in namespace
kubectl get secrets -n myapp
# Describe secret (doesn't show values)
kubectl describe secret myapp-credentials -n myapp
# View secret values (base64 encoded)
kubectl get secret myapp-credentials -n myapp -o yaml
# Decode secret value
kubectl get secret myapp-credentials -n myapp \
-o jsonpath='{.data.API_KEY}' | base64 -d
Secret Cloning (Kyverno)
Secrets labeled allowedToBeCloned: "true" in the secrets namespace are automatically cloned to new namespaces.
# Example: secrets-namespace.yaml
apiVersion: v1
kind: Secret
metadata:
name: shared-credentials
namespace: secrets
labels:
allowedToBeCloned: "true"
type: Opaque
data:
API_KEY: <base64-encoded-value>
When a new namespace is created, Kyverno automatically copies this secret.
Monitoring & Alerting
Prometheus Metrics
# Port forward to Prometheus
kubectl port-forward -n monitoring svc/prometheus-server 9090:80
# Access: http://localhost:9090
Common Queries:
# CPU usage per pod
sum(rate(container_cpu_usage_seconds_total[5m])) by (pod)
# Memory usage per pod
sum(container_memory_usage_bytes) by (pod)
# Request rate per service
rate(http_requests_total[5m])
Grafana Dashboards
# Port forward to Grafana
kubectl port-forward -n monitoring svc/grafana 3000:80
# Access: http://localhost:3000
Loki Logs
# Port forward to Loki
kubectl port-forward -n monitoring svc/loki 3100:3100
# Query logs
curl -G -s 'http://localhost:3100/loki/api/v1/query_range' \
--data-urlencode 'query={namespace="myapp"}' \
--data-urlencode 'start=1h' | jq
Fluent-Bit Log Shipping
Verify Fluent-Bit is shipping logs:
# Check Fluent-Bit pods
kubectl get pods -n monitoring | grep fluent-bit
# Check logs
kubectl logs -n monitoring daemonset/fluent-bit
# Verify Loki is receiving logs
kubectl logs -n monitoring deployment/loki | grep "POST /loki/api/v1/push"
Trivy Vulnerability Scanning
# Check Trivy scan results
kubectl get vulnerabilityreports --all-namespaces
# View report for specific pod
kubectl describe vulnerabilityreport -n myapp <report-name>
Slack Notifications
All applications have Slack notifications enabled:
metadata:
annotations:
notifications.argoproj.io/subscribe.on-sync-succeeded.slack: ""
notifications.argoproj.io/subscribe.on-sync-failed.slack: ""
notifications.argoproj.io/subscribe.on-degraded.slack: ""
Test Notification:
# Trigger a sync to test
kubectl patch application myapp -n argocd \
--type merge \
-p '{"metadata":{"annotations":{"argocd.argoproj.io/refresh":"hard"}}}'
Troubleshooting
Application Won't Sync
Check Application Status
kubectl describe application myapp -n argocd
Look for errors in:
Status.ConditionsStatus.OperationState
Common Issues
Issue 1: Image Pull Error
# Error: ErrImagePull, ImagePullBackOff
# Check if image exists
docker pull ghcr.io/fortedigital/myapp:v1.0.0
# Check image pull secrets
kubectl get secrets -n myapp | grep regcred
# Check pod events
kubectl describe pod -n myapp <pod-name>
Issue 2: Invalid YAML
# Error: unable to decode manifest
# Validate YAML locally
kubectl apply --dry-run=client -f apps/myapp.yaml
# Check ArgoCD application controller logs
kubectl logs -n argocd deployment/argocd-application-controller | grep myapp
Issue 3: Resource Quota Exceeded
# Error: exceeded quota
# Check namespace quotas
kubectl get resourcequota -n myapp
kubectl describe resourcequota -n myapp
# Increase quota or reduce resource requests
Pod Crashes
CrashLoopBackOff
# Check pod status
kubectl get pods -n myapp
# View logs
kubectl logs -n myapp <pod-name>
kubectl logs -n myapp <pod-name> --previous # Previous container
# Check events
kubectl describe pod -n myapp <pod-name>
Common Causes:
- Application error (check logs)
- Missing environment variables
- Wrong port configuration
- Missing secrets
- Insufficient memory/CPU
ImagePullBackOff
# Check image name
kubectl get deployment myapp -n myapp -o yaml | grep image
# Verify credentials
kubectl get secret -n myapp
Pending
# Check why pod is pending
kubectl describe pod -n myapp <pod-name>
# Common reasons:
# - Insufficient resources on nodes
# - PVC not bound
# - Node selector doesn't match
Ingress / TLS Issues
Application Not Accessible
# Check IngressRoute
kubectl get ingressroute -n myapp
kubectl describe ingressroute myapp -n myapp
# Check Traefik
kubectl get pods -n traefik
kubectl logs -n traefik deployment/traefik
# Test with port-forward
kubectl port-forward -n myapp service/myapp 8080:3000
curl http://localhost:8080
Certificate Issues
# Check certificates
kubectl get certificate -n myapp
kubectl describe certificate myapp-tls -n myapp
# Check cert-manager
kubectl get clusterissuer
kubectl logs -n cert-manager deployment/cert-manager
# Check Let's Encrypt challenges
kubectl get challenges --all-namespaces
Manual Certificate Renewal:
# Delete and recreate certificate
kubectl delete certificate myapp-tls -n myapp
# Certificate will be automatically recreated
Database Issues
PostgreSQL Won't Start
# Check StatefulSet
kubectl get statefulset -n myapp
kubectl describe statefulset postgres -n myapp
# Check PVC
kubectl get pvc -n myapp
kubectl describe pvc -n myapp
# Check logs
kubectl logs -n myapp postgres-0
Data Persistence
# Verify PVC is bound
kubectl get pvc -n myapp
# Check storage class
kubectl get storageclass
# Resize PVC (if supported)
kubectl edit pvc postgres-data-postgres-0 -n myapp
# Change: storage: 10Gi (from 5Gi)
Kyverno Policy Issues
Policy Violations
# List policies
kubectl get clusterpolicy
# Check policy reports
kubectl get policyreport --all-namespaces
# View specific policy
kubectl describe clusterpolicy secret-cloner
Secret Not Cloned
# Check if secret has label
kubectl get secret -n secrets --show-labels
# Check Kyverno logs
kubectl logs -n kyverno deployment/kyverno
# Manually trigger by recreating namespace
kubectl delete ns test-ns
kubectl create ns test-ns
ArgoCD Issues
ArgoCD UI Not Accessible
# Check ArgoCD pods
kubectl get pods -n argocd
# Restart ArgoCD server
kubectl rollout restart deployment argocd-server -n argocd
# Port forward
kubectl port-forward svc/argocd-server -n argocd 8080:443
Sync Takes Too Long
# Check application controller logs
kubectl logs -n argocd deployment/argocd-application-controller
# Increase timeout (in apps/myapp.yaml)
spec:
syncPolicy:
retry:
backoff:
maxDuration: 5m # Increase from 3m
Disaster Recovery
Backup Strategy
Current State: No automated backups
What Needs Backup:
- ❌ Cluster state (not backed up - recreate via GitOps)
- ❌ Persistent volumes (currently not critical)
- ✅ Git repositories (GitHub provides backup)
- ⚠️ Secrets (sealed secrets in Git, unseal keys need safekeeping)
Cluster Rebuild
Scenario: Complete cluster failure
# 1. Provision new Kubernetes cluster
# 2. Configure kubectl
kubectl config use-context new-cluster
kubectl cluster-info
# 3. Bootstrap cluster
cd ~/dev/k8s/launchpad
./bootstrap.sh
# 4. Wait for ArgoCD to sync all applications
kubectl get applications -n argocd -w
# 5. Recreate any unsealed secrets (from password manager)
# 6. Configure DNS for new cluster IPs
# 7. Verify all applications are healthy
Time Estimate: 30-60 minutes
Data Loss:
- Ephemeral data: Lost
- Database data: Lost (no backups currently)
- Configuration: No loss (in Git)
Future Backup Plan
Recommended:
-
Velero for cluster backups
helm install velero vmware-tanzu/velero \ --namespace velero \ --create-namespace \ --set configuration.provider=aws \ --set configuration.backupStorageLocation[0].bucket=cluster-backups -
PostgreSQL backups via CronJob
# pg-backup-cronjob.yaml kind: CronJob spec: schedule: "0 2 * * *" # Daily at 2am jobTemplate: spec: template: spec: containers: - name: pg-dump image: postgres:16-alpine command: - /bin/sh - -c - pg_dump -U $DB_USER -d $DB_NAME > /backup/dump-$(date +%Y%m%d).sql -
Sealed Secrets private key backup
# Backup sealed-secrets controller private key kubectl get secret -n kube-system sealed-secrets-key \ -o yaml > sealed-secrets-key-backup.yaml # Store in secure location (password manager, vault)
Maintenance Procedures
Upgrading ArgoCD
# Check current version
kubectl get deployment argocd-server -n argocd \
-o jsonpath='{.spec.template.spec.containers[0].image}'
# Update version in values
vim infra/values/argocd-values.yaml
# Or upgrade via Helm directly
helm upgrade argocd argo-cd \
--repo https://argoproj.github.io/argo-helm \
--namespace argocd \
--values infra/values/argocd-values.yaml \
--version 6.0.0 # New version
# Verify
kubectl get pods -n argocd
Upgrading Kubernetes Version
# UpCloud: Upgrade via control panel or CLI
# After upgrade, verify cluster
kubectl version
kubectl get nodes
# Check for deprecated APIs
kubectl api-resources
# Update any deprecated resources in Git
Rotating TLS Certificates
Let's Encrypt certificates auto-renew, but if manual rotation is needed:
# Delete certificate to force renewal
kubectl delete certificate myapp-tls -n myapp
# Cert-manager will automatically recreate
kubectl get certificate -n myapp -w
Cleaning Up Old Resources
# List all namespaces
kubectl get namespaces
# Remove unused namespaces
kubectl delete namespace old-app
# Clean up ArgoCD applications
kubectl get applications -n argocd
kubectl delete application old-app -n argocd
# Clean up old Docker images (on nodes)
# SSH to nodes and run:
docker image prune -a --filter "until=720h" # 30 days
DNS Management
Adding New Subdomain:
-
Add DNS A record pointing to Traefik LoadBalancer IP
# Get LoadBalancer IP kubectl get svc -n traefik traefik -o jsonpath='{.status.loadBalancer.ingress[0].ip}' -
Add to DNS provider:
myapp.forteapps.net A <LoadBalancer-IP> -
Verify DNS propagation:
nslookup myapp.forteapps.net dig myapp.forteapps.net
Monitoring Resource Usage
# Node resource usage
kubectl top nodes
# Pod resource usage
kubectl top pods --all-namespaces
# Identify resource hogs
kubectl top pods --all-namespaces --sort-by=memory
kubectl top pods --all-namespaces --sort-by=cpu
Advanced Operations
Adding a New Infrastructure Component
Example: Adding Redis
# 1. Create application manifest
cat > infra/redis-application.yaml <<EOF
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: redis
namespace: argocd
annotations:
argocd.argoproj.io/sync-wave: "1"
spec:
project: default
source:
repoURL: https://charts.bitnami.com/bitnami
chart: redis
targetRevision: 18.0.0
helm:
values: |
auth:
enabled: true
password: changeme
destination:
server: https://kubernetes.default.svc
namespace: redis
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
EOF
# 2. Commit and push
git add infra/redis-application.yaml
git commit -m "Add Redis infrastructure component"
git push
# 3. ArgoCD will auto-sync within 60 seconds
Multi-Cluster Setup (Future)
For multi-cluster deployments:
# Different destinations per environment
# dev-cluster
destination:
server: https://dev.k8s.example.com
namespace: myapp
# prod-cluster
destination:
server: https://prod.k8s.example.com
namespace: myapp
Blue-Green Deployments
# Deploy blue version
helm install myapp-blue forteapp \
--set app.image.tag=v1.0.0
# Deploy green version
helm install myapp-green forteapp \
--set app.image.tag=v2.0.0
# Switch traffic via IngressRoute
kubectl patch ingressroute myapp -n myapp --type merge \
-p '{"spec":{"routes":[{"services":[{"name":"myapp-green"}]}]}}'
# Remove blue deployment after validation
helm uninstall myapp-blue
Emergency Procedures
Emergency Rollback
# Immediate rollback
kubectl rollout undo deployment myapp -n myapp
# Update Git to make permanent
cd ~/dev/k8s/helm-prod-values
git revert HEAD
git push
Emergency Scale Down
# Scale to zero (maintenance mode)
kubectl scale deployment myapp -n myapp --replicas=0
# Update Git
vim helm-values/myapp/values.yaml
# Set replicaCount: 0
git commit -am "Scale down myapp for maintenance"
git push
Emergency Application Removal
# Remove application but keep data
kubectl patch application myapp -n argocd \
-p '{"metadata":{"finalizers":[]}}' --type merge
kubectl delete application myapp -n argocd
# Resources remain in cluster
Useful Scripts
Sync All Applications
#!/bin/bash
# sync-all.sh
for app in $(kubectl get applications -n argocd -o name); do
kubectl patch $app -n argocd \
--type merge \
-p '{"metadata":{"annotations":{"argocd.argoproj.io/refresh":"hard"}}}'
done
Check All Applications Health
#!/bin/bash
# health-check.sh
kubectl get applications -n argocd \
-o custom-columns=\
NAME:.metadata.name,\
SYNC:.status.sync.status,\
HEALTH:.status.health.status,\
MESSAGE:.status.health.message
Seal Secret Helper
#!/bin/bash
# seal-secret.sh
NAMESPACE=${1:-default}
SECRET_FILE=${2:-private/secret.yaml}
OUTPUT_FILE=${3:-secrets/secret-sealed.yaml}
kubeseal --format=yaml \
--cert=pub-cert.pem \
--namespace=$NAMESPACE \
< $SECRET_FILE \
> $OUTPUT_FILE
echo "Sealed secret created: $OUTPUT_FILE"
echo "Remember to delete: $SECRET_FILE"
Checklist Templates
New Application Deployment Checklist
- Application code repository created
- Dockerfile created and tested
- GitHub Actions workflow configured
- Helm values created in
helm-prod-values/ - ArgoCD application manifest created in
apps/ - Secrets created and sealed
- DNS record added for domain
- Application synced successfully
- Health check passed
- Slack notification received
- Application accessible via domain
- Monitoring configured
- Documentation updated
Incident Response Checklist
- Incident identified (Slack alert, monitoring)
- Severity assessed
- Incident channel created
- Initial investigation (logs, metrics, events)
- Root cause identified
- Mitigation applied
- Verification of fix
- Post-mortem scheduled
- Documentation updated
Last Updated: 2026-03-16 Maintained By: Platform Team Emergency Contact: #platform-support on Slack