Forte/launchpad

Fork 0

Files

Danijel Simeunovic d02da33700 docs

2026-03-16 11:00:42 +01:00

24 KiB

Raw Blame History

Operations Runbook

Overview
Cluster Bootstrap
Day-to-Day Operations
Application Management
Secret Management
Monitoring & Alerting
Troubleshooting
Disaster Recovery
Maintenance Procedures

Overview

This runbook provides operational procedures for maintaining the Kubernetes cluster and managing applications. It's intended for platform engineers and operators with full cluster access.

Operator Prerequisites

✅ Full kubectl access to cluster
✅ Write access to all Git repositories
✅ ArgoCD UI access
✅ Slack notifications configured
✅ Understanding of Kubernetes concepts

Cluster Bootstrap

Initial Cluster Setup

Bootstrap a new cluster from scratch:

Prerequisites

Kubernetes cluster running (UpCloud or any K8s cluster)
kubectl configured with admin access
Repositories cloned locally

# Verify cluster access
kubectl cluster-info
kubectl get nodes

Bootstrap Procedure

# 1. Clone config repository
git clone https://github.com/snothub/sturdy-adventure.git
cd sturdy-adventure

# 2. Set cluster name (optional)
export CLUSTER_NAME="prod-cluster-01"

# 3. Run bootstrap script
./bootstrap.sh

What Happens:

✅ Installs ArgoCD via Helm
✅ Configures ArgoCD with custom values
✅ Applies root App-of-Apps manifest
✅ ArgoCD automatically syncs all applications
✅ Infrastructure and apps deploy in waves

Verify Bootstrap

# Wait for ArgoCD to be ready
kubectl wait --for=condition=available --timeout=300s \
  deployment/argocd-server -n argocd

# Check ArgoCD applications
kubectl get applications -n argocd

# Expected output: infrastructure-apps, enterprise-apps, and all child apps

Post-Bootstrap Steps

Configure DNS for ingress domains:
- argocd.127.0.0.1.nip.io (local dev)
- *.forteapps.net (production)

Verify Let's Encrypt certificates:

kubectl get certificate --all-namespaces
kubectl get clusterissuer

Check Kyverno policies:
```
kubectl get clusterpolicy
```
Verify monitoring stack:
```
kubectl get pods -n monitoring
```
Test Slack notifications by triggering a sync

Day-to-Day Operations

Monitoring ArgoCD Sync Status

Via Slack

All applications send notifications to shared Slack channel:

✅ on-sync-succeeded - Deployment succeeded
❌ on-sync-failed - Deployment failed
⚠️ on-degraded - Application unhealthy

Via CLI

# List all applications
kubectl get applications -n argocd

# Watch application status
kubectl get applications -n argocd -w

# Get detailed status
kubectl describe application myapp -n argocd

Via ArgoCD UI

# Port forward to UI
kubectl port-forward svc/argocd-server -n argocd 8080:443

# Access: https://localhost:8080
# No login required (insecure mode for internal use)

Checking Application Health

# Quick health check for all apps
kubectl get applications -n argocd \
  -o custom-columns=NAME:.metadata.name,SYNC:.status.sync.status,HEALTH:.status.health.status

# Expected output:
# NAME                  SYNC        HEALTH
# infrastructure-apps   Synced      Healthy
# enterprise-apps       Synced      Healthy
# mcp10x                Synced      Healthy
# musicman              Synced      Healthy

Manual Sync

Force sync an application:

# Trigger sync
kubectl patch application myapp -n argocd \
  --type merge \
  -p '{"metadata":{"annotations":{"argocd.argoproj.io/refresh":"hard"}}}'

# Or via ArgoCD CLI (if installed)
argocd app sync myapp

Pausing Auto-Sync

Temporarily disable automatic syncing:

# Edit application
kubectl edit application myapp -n argocd

# Set automated to null
spec:
  syncPolicy:
    automated: null  # Disable auto-sync

# Re-enable later
spec:
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

Application Management

Deploying a New Application

See Developer Guide for detailed steps.

Quick checklist:

Create helm-values/myapp/values.yaml
Create apps/myapp.yaml in config repo
Create SealedSecret if needed
Commit and push changes
Verify sync in Slack/ArgoCD
Configure DNS for domain
Test application accessibility

Removing an Application

Safe Removal Procedure

# 1. Delete ArgoCD Application (with cascade)
kubectl delete application myapp -n argocd

# This will:
# - Remove application from ArgoCD
# - Delete all Kubernetes resources (cascade)
# - Remove namespace

# 2. Clean up Git repositories
cd ~/dev/k8s/launchpad
git rm apps/myapp.yaml
git commit -m "Remove myapp application"
git push

cd ~/dev/k8s/helm-prod-values
git rm -r myapp/
git commit -m "Remove myapp values"
git push

# 3. Remove sealed secrets (if any)
cd ~/dev/k8s/launchpad
git rm secrets/myapp-credentials-sealed.yaml
git commit -m "Remove myapp secrets"
git push

Removal Without Cascade

To remove from ArgoCD but keep resources running:

# Delete application with no cascade
kubectl patch application myapp -n argocd \
  -p '{"metadata":{"finalizers":[]}}' --type merge
kubectl delete application myapp -n argocd

# Resources remain in cluster but are no longer managed

Scaling Applications

Manual Scaling

# Scale deployment directly
kubectl scale deployment myapp -n myapp --replicas=3

# Note: If selfHeal is enabled, this will be reverted

GitOps Scaling

Update helm-values/myapp/values.yaml:

app:
  replicaCount: 3  # Change from 1 to 3

Commit and push - ArgoCD will sync.

Auto-Scaling (HPA)

Enable Horizontal Pod Autoscaler:

# In helm-values/myapp/values.yaml
app:
  hpa:
    enabled: true
    minReplicas: 2
    maxReplicas: 10
    targetCPUUtilizationPercentage: 70

Note: Remove replicaCount from ArgoCD ignore list if using HPA:

# In apps/myapp.yaml
ignoreDifferences:
- group: apps
  kind: Deployment
  jsonPointers:
  - /spec/replicas  # Remove this line

Rolling Back Deployments

Option 1: Git Revert

# Find the commit before the bad change
cd ~/dev/k8s/helm-prod-values
git log --oneline myapp/values.yaml

# Revert to previous version
git revert <commit-hash>
git push

# ArgoCD will sync the rollback

Option 2: Manual Rollback

# Rollback to previous revision
kubectl rollout undo deployment myapp -n myapp

# Note: This will be reverted by ArgoCD selfHeal
# Make permanent by updating Git

Option 3: Change Image Tag

# Edit helm-values
cd ~/dev/k8s/helm-prod-values
vim myapp/values.yaml

# Change image tag to previous version
app:
  image:
    tag: v1.0.0  # Roll back from v1.0.1

# Commit and push
git add myapp/values.yaml
git commit -m "Rollback myapp to v1.0.0"
git push

Resource Updates

Update Resource Limits

# In helm-values/myapp/values.yaml
app:
  resources:
    requests:
      cpu: 200m      # Increased from 100m
      memory: 512Mi  # Increased from 256Mi
    limits:
      cpu: 1000m
      memory: 2Gi

Enable Database

# In helm-values/myapp/values.yaml
db:
  enabled: true
  persistence:
    size: 10Gi  # Increase storage

Secret Management

Creating Secrets

Step 1: Get Public Certificate

# Fetch sealed-secrets public cert (one-time)
kubeseal --fetch-cert \
  --controller-name=sealed-secrets-controller \
  --controller-namespace=kube-system \
  > pub-cert.pem

# Save this certificate for future use

Step 2: Create Plain Secret

# Method 1: From literal values
kubectl create secret generic myapp-credentials \
  --from-literal=API_KEY=secret123 \
  --from-literal=DB_PASSWORD=pass456 \
  --namespace=myapp \
  --dry-run=client -o yaml > private/myapp-credentials.yaml

# Method 2: From file
kubectl create secret generic myapp-credentials \
  --from-file=.env \
  --namespace=myapp \
  --dry-run=client -o yaml > private/myapp-credentials.yaml

# Method 3: From multiple files
kubectl create secret generic myapp-credentials \
  --from-file=api-key.txt \
  --from-file=db-password.txt \
  --namespace=myapp \
  --dry-run=client -o yaml > private/myapp-credentials.yaml

Step 3: Seal Secret

kubeseal --format=yaml \
  --cert=pub-cert.pem \
  --namespace=myapp \
  < private/myapp-credentials.yaml \
  > secrets/myapp-credentials-sealed.yaml

Step 4: Commit Sealed Secret

git add secrets/myapp-credentials-sealed.yaml
git commit -m "Add myapp credentials"
git push

# Delete plain secret
rm private/myapp-credentials.yaml

Updating Secrets

# 1. Create new version
kubectl create secret generic myapp-credentials \
  --from-literal=API_KEY=new-secret-key \
  --from-literal=DB_PASSWORD=new-password \
  --namespace=myapp \
  --dry-run=client -o yaml > private/myapp-credentials.yaml

# 2. Seal it
kubeseal --format=yaml \
  --cert=pub-cert.pem \
  --namespace=myapp \
  < private/myapp-credentials.yaml \
  > secrets/myapp-credentials-sealed.yaml

# 3. Commit
git add secrets/myapp-credentials-sealed.yaml
git commit -m "Update myapp credentials"
git push

# 4. Restart pods to pick up new secret
kubectl rollout restart deployment myapp -n myapp

# 5. Delete plain secret
rm private/myapp-credentials.yaml

Viewing Secrets (Unsealed)

# List secrets in namespace
kubectl get secrets -n myapp

# Describe secret (doesn't show values)
kubectl describe secret myapp-credentials -n myapp

# View secret values (base64 encoded)
kubectl get secret myapp-credentials -n myapp -o yaml

# Decode secret value
kubectl get secret myapp-credentials -n myapp \
  -o jsonpath='{.data.API_KEY}' | base64 -d

Secret Cloning (Kyverno)

Secrets labeled allowedToBeCloned: "true" in the secrets namespace are automatically cloned to new namespaces.

# Example: secrets-namespace.yaml
apiVersion: v1
kind: Secret
metadata:
  name: shared-credentials
  namespace: secrets
  labels:
    allowedToBeCloned: "true"
type: Opaque
data:
  API_KEY: <base64-encoded-value>

When a new namespace is created, Kyverno automatically copies this secret.

Monitoring & Alerting

Prometheus Metrics

# Port forward to Prometheus
kubectl port-forward -n monitoring svc/prometheus-server 9090:80

# Access: http://localhost:9090

Common Queries:

# CPU usage per pod
sum(rate(container_cpu_usage_seconds_total[5m])) by (pod)

# Memory usage per pod
sum(container_memory_usage_bytes) by (pod)

# Request rate per service
rate(http_requests_total[5m])

Grafana Dashboards

# Port forward to Grafana
kubectl port-forward -n monitoring svc/grafana 3000:80

# Access: http://localhost:3000

Loki Logs

# Port forward to Loki
kubectl port-forward -n monitoring svc/loki 3100:3100

# Query logs
curl -G -s 'http://localhost:3100/loki/api/v1/query_range' \
  --data-urlencode 'query={namespace="myapp"}' \
  --data-urlencode 'start=1h' | jq

Fluent-Bit Log Shipping

Verify Fluent-Bit is shipping logs:

# Check Fluent-Bit pods
kubectl get pods -n monitoring | grep fluent-bit

# Check logs
kubectl logs -n monitoring daemonset/fluent-bit

# Verify Loki is receiving logs
kubectl logs -n monitoring deployment/loki | grep "POST /loki/api/v1/push"

Trivy Vulnerability Scanning

# Check Trivy scan results
kubectl get vulnerabilityreports --all-namespaces

# View report for specific pod
kubectl describe vulnerabilityreport -n myapp <report-name>

Slack Notifications

All applications have Slack notifications enabled:

metadata:
  annotations:
    notifications.argoproj.io/subscribe.on-sync-succeeded.slack: ""
    notifications.argoproj.io/subscribe.on-sync-failed.slack: ""
    notifications.argoproj.io/subscribe.on-degraded.slack: ""

Test Notification:

# Trigger a sync to test
kubectl patch application myapp -n argocd \
  --type merge \
  -p '{"metadata":{"annotations":{"argocd.argoproj.io/refresh":"hard"}}}'

Troubleshooting

Application Won't Sync

Check Application Status

kubectl describe application myapp -n argocd

Look for errors in:

Status.Conditions
Status.OperationState

Common Issues

Issue 1: Image Pull Error

# Error: ErrImagePull, ImagePullBackOff

# Check if image exists
docker pull ghcr.io/fortedigital/myapp:v1.0.0

# Check image pull secrets
kubectl get secrets -n myapp | grep regcred

# Check pod events
kubectl describe pod -n myapp <pod-name>

Issue 2: Invalid YAML

# Error: unable to decode manifest

# Validate YAML locally
kubectl apply --dry-run=client -f apps/myapp.yaml

# Check ArgoCD application controller logs
kubectl logs -n argocd deployment/argocd-application-controller | grep myapp

Issue 3: Resource Quota Exceeded

# Error: exceeded quota

# Check namespace quotas
kubectl get resourcequota -n myapp
kubectl describe resourcequota -n myapp

# Increase quota or reduce resource requests

Pod Crashes

CrashLoopBackOff

# Check pod status
kubectl get pods -n myapp

# View logs
kubectl logs -n myapp <pod-name>
kubectl logs -n myapp <pod-name> --previous  # Previous container

# Check events
kubectl describe pod -n myapp <pod-name>

Common Causes:

Application error (check logs)
Missing environment variables
Wrong port configuration
Missing secrets
Insufficient memory/CPU

ImagePullBackOff

# Check image name
kubectl get deployment myapp -n myapp -o yaml | grep image

# Verify credentials
kubectl get secret -n myapp

Pending

# Check why pod is pending
kubectl describe pod -n myapp <pod-name>

# Common reasons:
# - Insufficient resources on nodes
# - PVC not bound
# - Node selector doesn't match

Ingress / TLS Issues

Application Not Accessible

# Check IngressRoute
kubectl get ingressroute -n myapp
kubectl describe ingressroute myapp -n myapp

# Check Traefik
kubectl get pods -n traefik
kubectl logs -n traefik deployment/traefik

# Test with port-forward
kubectl port-forward -n myapp service/myapp 8080:3000
curl http://localhost:8080

Certificate Issues

# Check certificates
kubectl get certificate -n myapp
kubectl describe certificate myapp-tls -n myapp

# Check cert-manager
kubectl get clusterissuer
kubectl logs -n cert-manager deployment/cert-manager

# Check Let's Encrypt challenges
kubectl get challenges --all-namespaces

Manual Certificate Renewal:

# Delete and recreate certificate
kubectl delete certificate myapp-tls -n myapp

# Certificate will be automatically recreated

Database Issues

PostgreSQL Won't Start

# Check StatefulSet
kubectl get statefulset -n myapp
kubectl describe statefulset postgres -n myapp

# Check PVC
kubectl get pvc -n myapp
kubectl describe pvc -n myapp

# Check logs
kubectl logs -n myapp postgres-0

Data Persistence

# Verify PVC is bound
kubectl get pvc -n myapp

# Check storage class
kubectl get storageclass

# Resize PVC (if supported)
kubectl edit pvc postgres-data-postgres-0 -n myapp
# Change: storage: 10Gi (from 5Gi)

Kyverno Policy Issues

Policy Violations

# List policies
kubectl get clusterpolicy

# Check policy reports
kubectl get policyreport --all-namespaces

# View specific policy
kubectl describe clusterpolicy secret-cloner

Secret Not Cloned

# Check if secret has label
kubectl get secret -n secrets --show-labels

# Check Kyverno logs
kubectl logs -n kyverno deployment/kyverno

# Manually trigger by recreating namespace
kubectl delete ns test-ns
kubectl create ns test-ns

ArgoCD Issues

ArgoCD UI Not Accessible

# Check ArgoCD pods
kubectl get pods -n argocd

# Restart ArgoCD server
kubectl rollout restart deployment argocd-server -n argocd

# Port forward
kubectl port-forward svc/argocd-server -n argocd 8080:443

Sync Takes Too Long

# Check application controller logs
kubectl logs -n argocd deployment/argocd-application-controller

# Increase timeout (in apps/myapp.yaml)
spec:
  syncPolicy:
    retry:
      backoff:
        maxDuration: 5m  # Increase from 3m

Disaster Recovery

Backup Strategy

Current State: No automated backups

What Needs Backup:

❌ Cluster state (not backed up - recreate via GitOps)
❌ Persistent volumes (currently not critical)
✅ Git repositories (GitHub provides backup)
⚠️ Secrets (sealed secrets in Git, unseal keys need safekeeping)

Cluster Rebuild

Scenario: Complete cluster failure

# 1. Provision new Kubernetes cluster

# 2. Configure kubectl
kubectl config use-context new-cluster
kubectl cluster-info

# 3. Bootstrap cluster
cd ~/dev/k8s/launchpad
./bootstrap.sh

# 4. Wait for ArgoCD to sync all applications
kubectl get applications -n argocd -w

# 5. Recreate any unsealed secrets (from password manager)
# 6. Configure DNS for new cluster IPs
# 7. Verify all applications are healthy

Time Estimate: 30-60 minutes

Data Loss:

Ephemeral data: Lost
Database data: Lost (no backups currently)
Configuration: No loss (in Git)

Future Backup Plan

Recommended:

Velero for cluster backups

helm install velero vmware-tanzu/velero \
  --namespace velero \
  --create-namespace \
  --set configuration.provider=aws \
  --set configuration.backupStorageLocation[0].bucket=cluster-backups

PostgreSQL backups via CronJob

# pg-backup-cronjob.yaml
kind: CronJob
spec:
  schedule: "0 2 * * *"  # Daily at 2am
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: pg-dump
            image: postgres:16-alpine
            command:
            - /bin/sh
            - -c
            - pg_dump -U $DB_USER -d $DB_NAME > /backup/dump-$(date +%Y%m%d).sql

Sealed Secrets private key backup

# Backup sealed-secrets controller private key
kubectl get secret -n kube-system sealed-secrets-key \
  -o yaml > sealed-secrets-key-backup.yaml

# Store in secure location (password manager, vault)

Maintenance Procedures

Upgrading ArgoCD

# Check current version
kubectl get deployment argocd-server -n argocd \
  -o jsonpath='{.spec.template.spec.containers[0].image}'

# Update version in values
vim infra/values/argocd-values.yaml

# Or upgrade via Helm directly
helm upgrade argocd argo-cd \
  --repo https://argoproj.github.io/argo-helm \
  --namespace argocd \
  --values infra/values/argocd-values.yaml \
  --version 6.0.0  # New version

# Verify
kubectl get pods -n argocd

Upgrading Kubernetes Version

# UpCloud: Upgrade via control panel or CLI

# After upgrade, verify cluster
kubectl version
kubectl get nodes

# Check for deprecated APIs
kubectl api-resources

# Update any deprecated resources in Git

Rotating TLS Certificates

Let's Encrypt certificates auto-renew, but if manual rotation is needed:

# Delete certificate to force renewal
kubectl delete certificate myapp-tls -n myapp

# Cert-manager will automatically recreate
kubectl get certificate -n myapp -w

Cleaning Up Old Resources

# List all namespaces
kubectl get namespaces

# Remove unused namespaces
kubectl delete namespace old-app

# Clean up ArgoCD applications
kubectl get applications -n argocd
kubectl delete application old-app -n argocd

# Clean up old Docker images (on nodes)
# SSH to nodes and run:
docker image prune -a --filter "until=720h"  # 30 days

DNS Management

Adding New Subdomain:

Add DNS A record pointing to Traefik LoadBalancer IP

# Get LoadBalancer IP
kubectl get svc -n traefik traefik -o jsonpath='{.status.loadBalancer.ingress[0].ip}'

Add to DNS provider:

myapp.forteapps.net  A  <LoadBalancer-IP>

Verify DNS propagation:

nslookup myapp.forteapps.net
dig myapp.forteapps.net

Monitoring Resource Usage

# Node resource usage
kubectl top nodes

# Pod resource usage
kubectl top pods --all-namespaces

# Identify resource hogs
kubectl top pods --all-namespaces --sort-by=memory
kubectl top pods --all-namespaces --sort-by=cpu

Advanced Operations

Adding a New Infrastructure Component

Example: Adding Redis

# 1. Create application manifest
cat > infra/redis-application.yaml <<EOF
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: redis
  namespace: argocd
  annotations:
    argocd.argoproj.io/sync-wave: "1"
spec:
  project: default
  source:
    repoURL: https://charts.bitnami.com/bitnami
    chart: redis
    targetRevision: 18.0.0
    helm:
      values: |
        auth:
          enabled: true
          password: changeme
  destination:
    server: https://kubernetes.default.svc
    namespace: redis
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
    - CreateNamespace=true
EOF

# 2. Commit and push
git add infra/redis-application.yaml
git commit -m "Add Redis infrastructure component"
git push

# 3. ArgoCD will auto-sync within 60 seconds

Multi-Cluster Setup (Future)

For multi-cluster deployments:

# Different destinations per environment
# dev-cluster
destination:
  server: https://dev.k8s.example.com
  namespace: myapp

# prod-cluster
destination:
  server: https://prod.k8s.example.com
  namespace: myapp

Blue-Green Deployments

# Deploy blue version
helm install myapp-blue forteapp \
  --set app.image.tag=v1.0.0

# Deploy green version
helm install myapp-green forteapp \
  --set app.image.tag=v2.0.0

# Switch traffic via IngressRoute
kubectl patch ingressroute myapp -n myapp --type merge \
  -p '{"spec":{"routes":[{"services":[{"name":"myapp-green"}]}]}}'

# Remove blue deployment after validation
helm uninstall myapp-blue

Emergency Procedures

Emergency Rollback

# Immediate rollback
kubectl rollout undo deployment myapp -n myapp

# Update Git to make permanent
cd ~/dev/k8s/helm-prod-values
git revert HEAD
git push

Emergency Scale Down

# Scale to zero (maintenance mode)
kubectl scale deployment myapp -n myapp --replicas=0

# Update Git
vim helm-values/myapp/values.yaml
# Set replicaCount: 0
git commit -am "Scale down myapp for maintenance"
git push

Emergency Application Removal

# Remove application but keep data
kubectl patch application myapp -n argocd \
  -p '{"metadata":{"finalizers":[]}}' --type merge
kubectl delete application myapp -n argocd

# Resources remain in cluster

Useful Scripts

Sync All Applications

#!/bin/bash
# sync-all.sh
for app in $(kubectl get applications -n argocd -o name); do
  kubectl patch $app -n argocd \
    --type merge \
    -p '{"metadata":{"annotations":{"argocd.argoproj.io/refresh":"hard"}}}'
done

Check All Applications Health

#!/bin/bash
# health-check.sh
kubectl get applications -n argocd \
  -o custom-columns=\
NAME:.metadata.name,\
SYNC:.status.sync.status,\
HEALTH:.status.health.status,\
MESSAGE:.status.health.message

Seal Secret Helper

#!/bin/bash
# seal-secret.sh
NAMESPACE=${1:-default}
SECRET_FILE=${2:-private/secret.yaml}
OUTPUT_FILE=${3:-secrets/secret-sealed.yaml}

kubeseal --format=yaml \
  --cert=pub-cert.pem \
  --namespace=$NAMESPACE \
  < $SECRET_FILE \
  > $OUTPUT_FILE

echo "Sealed secret created: $OUTPUT_FILE"
echo "Remember to delete: $SECRET_FILE"

Checklist Templates

New Application Deployment Checklist

Application code repository created
Dockerfile created and tested
GitHub Actions workflow configured
Helm values created in helm-prod-values/
ArgoCD application manifest created in apps/
Secrets created and sealed
DNS record added for domain
Application synced successfully
Health check passed
Slack notification received
Application accessible via domain
Monitoring configured
Documentation updated

Incident Response Checklist

Incident identified (Slack alert, monitoring)
Severity assessed
Incident channel created
Initial investigation (logs, metrics, events)
Root cause identified
Mitigation applied
Verification of fix
Post-mortem scheduled
Documentation updated

Last Updated: 2026-03-16 Maintained By: Platform Team Emergency Contact: #platform-support on Slack

24 KiB Raw Blame History