Files
launchpad/docs/OPERATIONS-RUNBOOK.md
2026-04-13 15:54:14 +02:00

37 KiB

Operations Runbook

Table of Contents


Overview

This runbook provides operational procedures for maintaining the Kubernetes cluster and managing applications. It's intended for platform engineers and operators with full cluster access.

Operator Prerequisites

  • Full kubectl access to cluster
  • Write access to all Git repositories
  • ArgoCD UI access
  • Slack notifications configured
  • Understanding of Kubernetes concepts

Cluster Bootstrap

Initial Cluster Setup

Bootstrap a new cluster from scratch:

Prerequisites

  1. Kubernetes cluster running (UpCloud or any K8s cluster)
  2. kubectl configured with admin access
  3. Repositories cloned locally
# Verify cluster access
kubectl cluster-info
kubectl get nodes

Bootstrap Procedure

# 1. Clone config repository
git clone https://github.com/fortedigital/sturdy-adventure.git
cd sturdy-adventure

# 2. Set cluster name (optional)
export CLUSTER_NAME="prod-cluster-01"

# 3. Run bootstrap script
./bootstrap.sh

What Happens:

  1. Installs ArgoCD via Helm
  2. Configures ArgoCD with custom values
  3. Applies root App-of-Apps manifest
  4. ArgoCD automatically syncs all applications
  5. Infrastructure and apps deploy in waves

Verify Bootstrap

# Wait for ArgoCD to be ready
kubectl wait --for=condition=available --timeout=300s \
  deployment/argocd-server -n argocd

# Check ArgoCD applications
kubectl get applications -n argocd

# Expected output: infrastructure-apps, enterprise-apps, and all child apps

Post-Bootstrap Steps

  1. Configure DNS for ingress domains:

    • argocd.127.0.0.1.nip.io (local dev)
    • *.forteapps.net (production)
  2. Verify Let's Encrypt certificates:

    kubectl get certificate --all-namespaces
    kubectl get clusterissuer
    
  3. Check Kyverno policies:

    kubectl get clusterpolicy
    
  4. Verify monitoring stack:

    kubectl get pods -n monitoring
    
  5. Test Slack notifications by triggering a sync

ArgoCD Repository Access Setup

ArgoCD needs SSH access to private Git repositories to pull manifests and Helm values. This section covers setting up deploy keys for GitHub repositories.

Why Deploy Keys?

  • Read-only access: Deploy keys provide secure, read-only access to repositories
  • No user credentials: No need to share personal SSH keys or tokens
  • Repository-specific: Each repository gets its own key for better security
  • Revocable: Easy to revoke access without affecting other repositories

Prerequisites

  • kubectl access to the cluster
  • Write access to the GitHub repository
  • ArgoCD installed and running

Setup Procedure

Step 1: Generate SSH Key Pair

Generate a dedicated SSH key for ArgoCD without a passphrase (required for automated access):

# Generate ED25519 key (recommended - smaller and more secure)
ssh-keygen -t ed25519 -C "argocd-deploy-key-sturdy-adventure" -f argocd-deploy-key -N ""

# Or RSA key if ED25519 is not supported
ssh-keygen -t rsa -b 4096 -C "argocd-deploy-key-sturdy-adventure" -f argocd-deploy-key -N ""

This creates two files:

  • argocd-deploy-key - Private key (keep secret)
  • argocd-deploy-key.pub - Public key (add to GitHub)

Step 2: Add Public Key to GitHub

  1. Copy the public key:

    cat argocd-deploy-key.pub
    
  2. Go to GitHub repository settings:

    • Navigate to: https://github.com/fortedigital/sturdy-adventure/settings/keys
    • Or: Repository → Settings → Deploy keys
  3. Click "Add deploy key"

    • Title: ArgoCD Production Cluster
    • Key: Paste the public key content
    • ☐ Allow write access (leave unchecked - read-only is sufficient)
    • Click "Add key"
  4. Repeat for the helm-values repository if it's private:

    # Generate separate key for helm-values repo
    ssh-keygen -t ed25519 -C "argocd-deploy-key-helm-values" -f argocd-helm-values-key -N ""
    
    # Add to: https://github.com/fortedigital/helm-values/settings/keys
    

Step 3: Create Kubernetes Secret

Add the private key to ArgoCD as a repository secret:

Save the following file in private/ (gitignored) folder as secret.yaml

  apiVersion: v1
  kind: Secret
  metadata:
    name: forte-helm-repo
    namespace: argocd
    labels:
      argocd.argoproj.io/secret-type: repository
  stringData:
    type: git
    url: ssh://git@git.forteapps.net:2222/Forte/forte-helm.git
    sshPrivateKey: |
      <paste your private key here>
    project: default

Seal the secret using kubeseal command

kubeseal --format=yaml \
  --namespace=argocd \
  < private/secret.yaml \
  > secrets/forte-helm-repo-secret-sealed.yaml

Step 4: Register Repository in ArgoCD

Check in secrets/forte-helm-repo-secret-sealed.yaml and let Argo sync and create the secret.

Step 5: Verify Repository Access

# Check if repository is connected
kubectl get secrets -n argocd -l argocd.argoproj.io/secret-type=repository

# Verify connection in ArgoCD UI
# Settings → Repositories → Should show "Successful" status

# Test by creating an application
kubectl apply -f _app-of-apps.yaml

# Check application sync status
kubectl get applications -n argocd

Testing Repository Access

Create a test application to verify SSH access:

cat > /tmp/test-repo-access.yaml <<EOF
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: test-repo-access
  namespace: argocd
spec:
  project: default
  source:
    repoURL: git@github.com:fortedigital/sturdy-adventure.git
    targetRevision: main
    path: cluster-resources
  destination:
    server: https://kubernetes.default.svc
    namespace: default
  syncPolicy:
    automated: null  # Manual sync for testing
EOF

kubectl apply -f /tmp/test-repo-access.yaml

# Check if ArgoCD can access the repository
kubectl describe application test-repo-access -n argocd

# Look for sync status - should show repository contents
kubectl get application test-repo-access -n argocd -o jsonpath='{.status.sync.status}'

# Clean up test application
kubectl delete application test-repo-access -n argocd
rm /tmp/test-repo-access.yaml

Security Best Practices

  1. Secure Private Keys

    # Store private key securely and delete local copy
    # Option 1: Store in password manager (recommended)
    # Option 2: Backup to encrypted storage
    
    # Delete local private key after adding to Kubernetes
    shred -u argocd-deploy-key
    
    # Or on Windows
    # Remove-Item -Path argocd-deploy-key -Force
    
  2. Rotate Keys Regularly

    # Generate new key
    ssh-keygen -t ed25519 -C "argocd-deploy-key-$(date +%Y%m)" -f argocd-new-key -N ""
    
    # Add new public key to GitHub (keep old key for now)
    
    # Update Kubernetes secret
    kubectl create secret generic repo-sturdy-adventure \
      --from-file=sshPrivateKey=argocd-new-key \
      --namespace=argocd \
      --dry-run=client -o yaml | kubectl apply -f -
    
    # Test access, then remove old deploy key from GitHub
    
    # Clean up
    shred -u argocd-new-key
    
  3. Audit Repository Access

    # List all repository secrets
    kubectl get secrets -n argocd -l argocd.argoproj.io/secret-type=repository
    
    # Review deploy keys in GitHub
    # Visit: https://github.com/fortedigital/sturdy-adventure/settings/keys
    
  4. Use Different Keys per Repository

    • Don't reuse the same deploy key across repositories
    • If one key is compromised, only one repository is affected
    • Easier to track and audit access

Troubleshooting Repository Access

Issue: "permission denied (publickey)"

# Check if secret exists
kubectl get secret repo-sturdy-adventure -n argocd

# Verify secret has correct label
kubectl get secret repo-sturdy-adventure -n argocd -o yaml | grep argocd.argoproj.io/secret-type

# Check ArgoCD application controller logs
kubectl logs -n argocd deployment/argocd-application-controller | grep -i "permission denied"

# Verify deploy key is added to GitHub
# Visit: https://github.com/fortedigital/sturdy-adventure/settings/keys

Issue: "Host key verification failed"

# Add GitHub to known_hosts
kubectl exec -n argocd deployment/argocd-repo-server -- \
  ssh-keyscan github.com >> ~/.ssh/known_hosts

# Or disable strict host key checking (less secure)
kubectl patch secret repo-sturdy-adventure -n argocd \
  --type merge \
  -p '{"stringData":{"insecure":"true"}}'

Issue: Repository shows as "Unknown" status

# Check repository server logs
kubectl logs -n argocd deployment/argocd-repo-server

# Refresh repository connection
kubectl delete secret repo-sturdy-adventure -n argocd
# Recreate secret (see Step 3 above)

# Restart ArgoCD components
kubectl rollout restart deployment argocd-repo-server -n argocd
kubectl rollout restart deployment argocd-application-controller -n argocd

Multiple Repository Setup

For the three-repository pattern (sturdy-adventure, forte-helm, helm-values):

# 1. sturdy-adventure (main config repo)
ssh-keygen -t ed25519 -C "argocd-sturdy-adventure" -f key-sturdy -N ""
# Add key-sturdy.pub to: https://github.com/fortedigital/sturdy-adventure/settings/keys

# 2. helm-values (private values repo)
ssh-keygen -t ed25519 -C "argocd-helm-values" -f key-helm-values -N ""
# Add key-helm-values.pub to: https://github.com/fortedigital/helm-values/settings/keys

# 3. forte-helm (private helm charts repo)

# Create secrets
kubectl create secret generic repo-sturdy-adventure \
  --from-file=sshPrivateKey=key-sturdy \
  --namespace=argocd --dry-run=client -o yaml | \
  kubectl label --local -f - argocd.argoproj.io/secret-type=repository --dry-run=client -o yaml | \
  kubectl apply -f -

kubectl create secret generic repo-helm-values \
  --from-file=sshPrivateKey=key-helm-values \
  --namespace=argocd --dry-run=client -o yaml | \
  kubectl label --local -f - argocd.argoproj.io/secret-type=repository --dry-run=client -o yaml | \
  kubectl apply -f -

# Clean up keys
shred -u key-sturdy key-helm-values

Converting HTTPS to SSH

If you're currently using HTTPS and want to switch to SSH:

# 1. Generate and add deploy key (see steps above)

# 2. Update all Application manifests
# Change from:
#   repoURL: https://github.com/fortedigital/sturdy-adventure.git
# To:
#   repoURL: git@github.com:fortedigital/sturdy-adventure.git

# 3. Update and commit
find . -name "*.yaml" -type f -exec sed -i 's|https://github.com/fortedigital/|git@github.com:fortedigital/|g' {} +

git add .
git commit -m "Switch from HTTPS to SSH for repository access"
git push

# 4. ArgoCD will automatically re-sync with new SSH URLs

Day-to-Day Operations

Monitoring ArgoCD Sync Status

Via Slack

All applications send notifications to shared Slack channel:

  • on-sync-succeeded - Deployment succeeded
  • on-sync-failed - Deployment failed
  • ⚠️ on-degraded - Application unhealthy

Via CLI

# List all applications
kubectl get applications -n argocd

# Watch application status
kubectl get applications -n argocd -w

# Get detailed status
kubectl describe application myapp -n argocd

Via ArgoCD UI

# Port forward to UI
kubectl port-forward svc/argocd-server -n argocd 8080:443

# Access: https://localhost:8080
# No login required (insecure mode for internal use)

Checking Application Health

# Quick health check for all apps
kubectl get applications -n argocd \
  -o custom-columns=NAME:.metadata.name,SYNC:.status.sync.status,HEALTH:.status.health.status

# Expected output:
# NAME                  SYNC        HEALTH
# infrastructure-apps   Synced      Healthy
# enterprise-apps       Synced      Healthy
# mcp10x                Synced      Healthy
# musicman              Synced      Healthy

Manual Sync

Force sync an application:

# Trigger sync
kubectl patch application myapp -n argocd \
  --type merge \
  -p '{"metadata":{"annotations":{"argocd.argoproj.io/refresh":"hard"}}}'

# Or via ArgoCD CLI (if installed)
argocd app sync myapp

Pausing Auto-Sync

Temporarily disable automatic syncing:

# Edit application
kubectl edit application myapp -n argocd

# Set automated to null
spec:
  syncPolicy:
    automated: null  # Disable auto-sync

# Re-enable later
spec:
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

Application Management

Deploying a New Application

See Developer Guide for detailed steps.

Quick checklist:

  • Create helm-values/myapp/values.yaml
  • Create apps/myapp.yaml in config repo
  • Create SealedSecret if needed
  • Commit and push changes
  • Verify sync in Slack/ArgoCD
  • Configure DNS for domain
  • Test application accessibility

Removing an Application

Safe Removal Procedure

# 1. Delete ArgoCD Application (with cascade)
kubectl delete application myapp -n argocd

# This will:
# - Remove application from ArgoCD
# - Delete all Kubernetes resources (cascade)
# - Remove namespace

# 2. Clean up Git repositories
cd ~/dev/k8s/launchpad
git rm apps/myapp.yaml
git commit -m "Remove myapp application"
git push

cd ~/dev/k8s/helm-prod-values
git rm -r myapp/
git commit -m "Remove myapp values"
git push

# 3. Remove sealed secrets (if any)
cd ~/dev/k8s/launchpad
git rm secrets/myapp-credentials-sealed.yaml
git commit -m "Remove myapp secrets"
git push

Removal Without Cascade

To remove from ArgoCD but keep resources running:

# Delete application with no cascade
kubectl patch application myapp -n argocd \
  -p '{"metadata":{"finalizers":[]}}' --type merge
kubectl delete application myapp -n argocd

# Resources remain in cluster but are no longer managed

Scaling Applications

Manual Scaling

# Scale deployment directly
kubectl scale deployment myapp -n myapp --replicas=3

# Note: If selfHeal is enabled, this will be reverted

GitOps Scaling

Update helm-values/myapp/values.yaml:

app:
  replicaCount: 3  # Change from 1 to 3

Commit and push - ArgoCD will sync.

Auto-Scaling (HPA)

Enable Horizontal Pod Autoscaler:

# In helm-values/myapp/values.yaml
app:
  hpa:
    enabled: true
    minReplicas: 2
    maxReplicas: 10
    targetCPUUtilizationPercentage: 70

Note: Remove replicaCount from ArgoCD ignore list if using HPA:

# In apps/myapp.yaml
ignoreDifferences:
- group: apps
  kind: Deployment
  jsonPointers:
  - /spec/replicas  # Remove this line

Rolling Back Deployments

Option 1: Git Revert

# Find the commit before the bad change
cd ~/dev/k8s/helm-prod-values
git log --oneline myapp/values.yaml

# Revert to previous version
git revert <commit-hash>
git push

# ArgoCD will sync the rollback

Option 2: Manual Rollback

# Rollback to previous revision
kubectl rollout undo deployment myapp -n myapp

# Note: This will be reverted by ArgoCD selfHeal
# Make permanent by updating Git

Option 3: Change Image Tag

# Edit helm-values
cd ~/dev/k8s/helm-prod-values
vim myapp/values.yaml

# Change image tag to previous version
app:
  image:
    tag: v1.0.0  # Roll back from v1.0.1

# Commit and push
git add myapp/values.yaml
git commit -m "Rollback myapp to v1.0.0"
git push

Resource Updates

Update Resource Limits

# In helm-values/myapp/values.yaml
app:
  resources:
    requests:
      cpu: 200m      # Increased from 100m
      memory: 512Mi  # Increased from 256Mi
    limits:
      cpu: 1000m
      memory: 2Gi

Enable Database

# In helm-values/myapp/values.yaml
db:
  enabled: true
  persistence:
    size: 10Gi  # Increase storage

Secret Management

Creating Secrets

Step 1: Get Public Certificate

# Fetch sealed-secrets public cert (one-time)
kubeseal --fetch-cert \
  --controller-name=sealed-secrets-controller \
  --controller-namespace=kube-system \
  > pub-cert.pem

# Save this certificate for future use

Step 2: Create Plain Secret

# Method 1: From literal values
kubectl create secret generic myapp-credentials \
  --from-literal=API_KEY=secret123 \
  --from-literal=DB_PASSWORD=pass456 \
  --namespace=myapp \
  --dry-run=client -o yaml > private/myapp-credentials.yaml

# Method 2: From file
kubectl create secret generic myapp-credentials \
  --from-file=.env \
  --namespace=myapp \
  --dry-run=client -o yaml > private/myapp-credentials.yaml

# Method 3: From multiple files
kubectl create secret generic myapp-credentials \
  --from-file=api-key.txt \
  --from-file=db-password.txt \
  --namespace=myapp \
  --dry-run=client -o yaml > private/myapp-credentials.yaml

Step 3: Seal Secret

kubeseal --format=yaml \
  --cert=pub-cert.pem \
  --namespace=myapp \
  < private/myapp-credentials.yaml \
  > secrets/myapp-credentials-sealed.yaml

Step 4: Commit Sealed Secret

git add secrets/myapp-credentials-sealed.yaml
git commit -m "Add myapp credentials"
git push

# Delete plain secret
rm private/myapp-credentials.yaml

Updating Secrets

# 1. Create new version
kubectl create secret generic myapp-credentials \
  --from-literal=API_KEY=new-secret-key \
  --from-literal=DB_PASSWORD=new-password \
  --namespace=myapp \
  --dry-run=client -o yaml > private/myapp-credentials.yaml

# 2. Seal it
kubeseal --format=yaml \
  --cert=pub-cert.pem \
  --namespace=myapp \
  < private/myapp-credentials.yaml \
  > secrets/myapp-credentials-sealed.yaml

# 3. Commit
git add secrets/myapp-credentials-sealed.yaml
git commit -m "Update myapp credentials"
git push

# 4. Restart pods to pick up new secret
kubectl rollout restart deployment myapp -n myapp

# 5. Delete plain secret
rm private/myapp-credentials.yaml

Viewing Secrets (Unsealed)

# List secrets in namespace
kubectl get secrets -n myapp

# Describe secret (doesn't show values)
kubectl describe secret myapp-credentials -n myapp

# View secret values (base64 encoded)
kubectl get secret myapp-credentials -n myapp -o yaml

# Decode secret value
kubectl get secret myapp-credentials -n myapp \
  -o jsonpath='{.data.API_KEY}' | base64 -d

Secret Cloning (Kyverno)

Secrets labeled allowedToBeCloned: "true" in the secrets namespace are automatically cloned to new namespaces.

# Example: secrets-namespace.yaml
apiVersion: v1
kind: Secret
metadata:
  name: shared-credentials
  namespace: secrets
  labels:
    allowedToBeCloned: "true"
type: Opaque
data:
  API_KEY: <base64-encoded-value>

When a new namespace is created, Kyverno automatically copies this secret.

Authentication Secrets

Applications using the authentication sidecar require specific secrets depending on the auth mode.

Token Mode Secrets

Token-based auth uses an auth-tokens Secret:

# Method 1: From Helm values (automatic)
# Tokens specified in values.yaml are automatically created

# Method 2: Manual creation
kubectl create secret generic auth-tokens \
  --from-literal=tokens="token1
token2
token3" \
  --namespace=myapp

# Method 3: From file
echo "d4f88f6d9292c10cc3e21c4aad56d2be485db532b54fe961d738e1137d247823" > tokens.txt
echo "8803f621acc3898df1d7a8f514bc3602551a0681a8f747bd4e43c3c5849d57a7" >> tokens.txt
kubectl create secret generic auth-tokens \
  --from-file=tokens=tokens.txt \
  --namespace=myapp
rm tokens.txt

OIDC Mode Secrets

OIDC auth requires an auth-oidc Secret with two keys:

# Generate secrets
CLIENT_SECRET="your-oidc-client-secret-from-provider"
COOKIE_SECRET=$(openssl rand -hex 32)

# Create plain secret
kubectl create secret generic auth-oidc \
  --from-literal=client-secret=$CLIENT_SECRET \
  --from-literal=cookie-secret=$COOKIE_SECRET \
  --namespace=myapp \
  --dry-run=client -o yaml > private/myapp-auth-oidc.yaml

# Seal it
kubeseal --format=yaml \
  --cert=pub-cert.pem \
  --namespace=myapp \
  < private/myapp-auth-oidc.yaml \
  > secrets/myapp-auth-oidc-sealed.yaml

# Apply sealed secret
kubectl apply -f secrets/myapp-auth-oidc-sealed.yaml

# Commit to Git
git add secrets/myapp-auth-oidc-sealed.yaml
git commit -m "Add OIDC secrets for myapp"
git push

# Clean up
rm private/myapp-auth-oidc.yaml

Rotating Authentication Secrets

Token Rotation:

# Generate new token
NEW_TOKEN=$(openssl rand -hex 32)

# Get current tokens
kubectl get secret auth-tokens -n myapp -o yaml > /tmp/tokens.yaml

# Edit tokens (add new, optionally remove old)
# Then re-seal and apply

# Restart pods to use new tokens
kubectl rollout restart deployment myapp -n myapp

OIDC Secret Rotation:

# Rotate cookie secret (safe - invalidates existing sessions)
NEW_COOKIE_SECRET=$(openssl rand -hex 32)

# Recreate secret
kubectl create secret generic auth-oidc \
  --from-literal=client-secret=$CLIENT_SECRET \
  --from-literal=cookie-secret=$NEW_COOKIE_SECRET \
  --namespace=myapp \
  --dry-run=client -o yaml | \
  kubeseal --format=yaml --cert=pub-cert.pem --namespace=myapp | \
  kubectl apply -f -

# Restart to pick up new secret
kubectl rollout restart deployment myapp -n myapp

Viewing Authentication Secrets

# List auth-related secrets
kubectl get secrets -n myapp | grep auth

# View token secret (tokens are in plain text in the Secret)
kubectl get secret auth-tokens -n myapp -o jsonpath='{.data.tokens}' | base64 -d

# View OIDC secret keys (values are base64 encoded)
kubectl get secret auth-oidc -n myapp -o jsonpath='{.data.client-secret}' | base64 -d
kubectl get secret auth-oidc -n myapp -o jsonpath='{.data.cookie-secret}' | base64 -d

See: Developer Guide - Enabling Authentication for complete authentication setup guide.


Monitoring & Alerting

Prometheus Metrics

# Port forward to Prometheus
kubectl port-forward -n monitoring svc/prometheus-server 9090:80

# Access: http://localhost:9090

Common Queries:

# CPU usage per pod
sum(rate(container_cpu_usage_seconds_total[5m])) by (pod)

# Memory usage per pod
sum(container_memory_usage_bytes) by (pod)

# Request rate per service
rate(http_requests_total[5m])

Grafana Dashboards

# Port forward to Grafana
kubectl port-forward -n monitoring svc/grafana 3000:80

# Access: http://localhost:3000

Loki Logs

# Port forward to Loki
kubectl port-forward -n monitoring svc/loki 3100:3100

# Query logs
curl -G -s 'http://localhost:3100/loki/api/v1/query_range' \
  --data-urlencode 'query={namespace="myapp"}' \
  --data-urlencode 'start=1h' | jq

Tempo Traces

# Port forward to Tempo query API
kubectl port-forward -n monitoring svc/tempo 3200:3200

# Access: http://localhost:3200

Query traces via Grafana:

  1. Open Grafana → Explore
  2. Select Tempo datasource
  3. Use TraceQL or search by service name

Verify Traefik is sending traces:

# Check Traefik logs for OTLP export errors
kubectl logs -n traefik-system -l app.kubernetes.io/name=traefik | grep -i "traces export"

# Check Tempo is receiving data
kubectl logs -n monitoring -l app.kubernetes.io/name=tempo | grep "receiver"

Trace-to-log correlation:

  • Click a trace span in Grafana → linked Loki logs appear (by namespace, pod, container)
  • Trace-to-metrics links to Prometheus by service name

Fluent-Bit Log Shipping

Verify Fluent-Bit is shipping logs:

# Check Fluent-Bit pods
kubectl get pods -n monitoring | grep fluent-bit

# Check logs
kubectl logs -n monitoring daemonset/fluent-bit

# Verify Loki is receiving logs
kubectl logs -n monitoring deployment/loki | grep "POST /loki/api/v1/push"

Trivy Vulnerability Scanning

# Check Trivy scan results
kubectl get vulnerabilityreports --all-namespaces

# View report for specific pod
kubectl describe vulnerabilityreport -n myapp <report-name>

Slack Notifications

All applications have Slack notifications enabled:

metadata:
  annotations:
    notifications.argoproj.io/subscribe.on-sync-succeeded.slack: ""
    notifications.argoproj.io/subscribe.on-sync-failed.slack: ""
    notifications.argoproj.io/subscribe.on-degraded.slack: ""

Test Notification:

# Trigger a sync to test
kubectl patch application myapp -n argocd \
  --type merge \
  -p '{"metadata":{"annotations":{"argocd.argoproj.io/refresh":"hard"}}}'

Troubleshooting

Application Won't Sync

Check Application Status

kubectl describe application myapp -n argocd

Look for errors in:

  • Status.Conditions
  • Status.OperationState

Common Issues

Issue 1: Image Pull Error

# Error: ErrImagePull, ImagePullBackOff

# Check if image exists
docker pull ghcr.io/fortedigital/myapp:v1.0.0

# Check image pull secrets
kubectl get secrets -n myapp | grep regcred

# Check pod events
kubectl describe pod -n myapp <pod-name>

Issue 2: Invalid YAML

# Error: unable to decode manifest

# Validate YAML locally
kubectl apply --dry-run=client -f apps/myapp.yaml

# Check ArgoCD application controller logs
kubectl logs -n argocd deployment/argocd-application-controller | grep myapp

Issue 3: Resource Quota Exceeded

# Error: exceeded quota

# Check namespace quotas
kubectl get resourcequota -n myapp
kubectl describe resourcequota -n myapp

# Increase quota or reduce resource requests

Pod Crashes

CrashLoopBackOff

# Check pod status
kubectl get pods -n myapp

# View logs
kubectl logs -n myapp <pod-name>
kubectl logs -n myapp <pod-name> --previous  # Previous container

# Check events
kubectl describe pod -n myapp <pod-name>

Common Causes:

  • Application error (check logs)
  • Missing environment variables
  • Wrong port configuration
  • Missing secrets
  • Insufficient memory/CPU

ImagePullBackOff

# Check image name
kubectl get deployment myapp -n myapp -o yaml | grep image

# Verify credentials
kubectl get secret -n myapp

Pending

# Check why pod is pending
kubectl describe pod -n myapp <pod-name>

# Common reasons:
# - Insufficient resources on nodes
# - PVC not bound
# - Node selector doesn't match

Ingress / TLS Issues

Application Not Accessible

# Check IngressRoute
kubectl get ingressroute -n myapp
kubectl describe ingressroute myapp -n myapp

# Check Traefik
kubectl get pods -n traefik
kubectl logs -n traefik deployment/traefik

# Test with port-forward
kubectl port-forward -n myapp service/myapp 8080:3000
curl http://localhost:8080

Certificate Issues

# Check certificates
kubectl get certificate -n myapp
kubectl describe certificate myapp-tls -n myapp

# Check cert-manager
kubectl get clusterissuer
kubectl logs -n cert-manager deployment/cert-manager

# Check Let's Encrypt challenges
kubectl get challenges --all-namespaces

Manual Certificate Renewal:

# Delete and recreate certificate
kubectl delete certificate myapp-tls -n myapp

# Certificate will be automatically recreated

Database Issues

PostgreSQL Won't Start

# Check StatefulSet
kubectl get statefulset -n myapp
kubectl describe statefulset postgres -n myapp

# Check PVC
kubectl get pvc -n myapp
kubectl describe pvc -n myapp

# Check logs
kubectl logs -n myapp postgres-0

Data Persistence

# Verify PVC is bound
kubectl get pvc -n myapp

# Check storage class
kubectl get storageclass

# Resize PVC (if supported)
kubectl edit pvc postgres-data-postgres-0 -n myapp
# Change: storage: 10Gi (from 5Gi)

Kyverno Policy Issues

Policy Violations

# List policies
kubectl get clusterpolicy

# Check policy reports
kubectl get policyreport --all-namespaces

# View specific policy
kubectl describe clusterpolicy secret-cloner

Secret Not Cloned

# Check if secret has label
kubectl get secret -n secrets --show-labels

# Check Kyverno logs
kubectl logs -n kyverno deployment/kyverno

# Manually trigger by recreating namespace
kubectl delete ns test-ns
kubectl create ns test-ns

ArgoCD Issues

ArgoCD UI Not Accessible

# Check ArgoCD pods
kubectl get pods -n argocd

# Restart ArgoCD server
kubectl rollout restart deployment argocd-server -n argocd

# Port forward
kubectl port-forward svc/argocd-server -n argocd 8080:443

Sync Takes Too Long

# Check application controller logs
kubectl logs -n argocd deployment/argocd-application-controller

# Increase timeout (in apps/myapp.yaml)
spec:
  syncPolicy:
    retry:
      backoff:
        maxDuration: 5m  # Increase from 3m

Disaster Recovery

Backup Strategy

Current State: No automated backups

What Needs Backup:

  • Cluster state (not backed up - recreate via GitOps)
  • Persistent volumes (currently not critical)
  • Git repositories (GitHub provides backup)
  • ⚠️ Secrets (sealed secrets in Git, unseal keys need safekeeping)

Cluster Rebuild

Scenario: Complete cluster failure

# 1. Provision new Kubernetes cluster

# 2. Configure kubectl
kubectl config use-context new-cluster
kubectl cluster-info

# 3. Bootstrap cluster
cd ~/dev/k8s/launchpad
./bootstrap.sh

# 4. Wait for ArgoCD to sync all applications
kubectl get applications -n argocd -w

# 5. Recreate any unsealed secrets (from password manager)
# 6. Configure DNS for new cluster IPs
# 7. Verify all applications are healthy

Time Estimate: 30-60 minutes

Data Loss:

  • Ephemeral data: Lost
  • Database data: Lost (no backups currently)
  • Configuration: No loss (in Git)

Future Backup Plan

Recommended:

  1. Velero for cluster backups

    helm install velero vmware-tanzu/velero \
      --namespace velero \
      --create-namespace \
      --set configuration.provider=aws \
      --set configuration.backupStorageLocation[0].bucket=cluster-backups
    
  2. PostgreSQL backups via CronJob

    # pg-backup-cronjob.yaml
    kind: CronJob
    spec:
      schedule: "0 2 * * *"  # Daily at 2am
      jobTemplate:
        spec:
          template:
            spec:
              containers:
              - name: pg-dump
                image: postgres:16-alpine
                command:
                - /bin/sh
                - -c
                - pg_dump -U $DB_USER -d $DB_NAME > /backup/dump-$(date +%Y%m%d).sql
    
  3. Sealed Secrets private key backup

    # Backup sealed-secrets controller private key
    kubectl get secret -n kube-system sealed-secrets-key \
      -o yaml > sealed-secrets-key-backup.yaml
    
    # Store in secure location (password manager, vault)
    

Maintenance Procedures

Upgrading ArgoCD

# Check current version
kubectl get deployment argocd-server -n argocd \
  -o jsonpath='{.spec.template.spec.containers[0].image}'

# Update version in values
vim infra/values/argocd-values.yaml

# Or upgrade via Helm directly
helm upgrade argocd argo-cd \
  --repo https://argoproj.github.io/argo-helm \
  --namespace argocd \
  --values infra/values/argocd-values.yaml \
  --version 6.0.0  # New version

# Verify
kubectl get pods -n argocd

Upgrading Kubernetes Version

# UpCloud: Upgrade via control panel or CLI

# After upgrade, verify cluster
kubectl version
kubectl get nodes

# Check for deprecated APIs
kubectl api-resources

# Update any deprecated resources in Git

Rotating TLS Certificates

Let's Encrypt certificates auto-renew, but if manual rotation is needed:

# Delete certificate to force renewal
kubectl delete certificate myapp-tls -n myapp

# Cert-manager will automatically recreate
kubectl get certificate -n myapp -w

Cleaning Up Old Resources

# List all namespaces
kubectl get namespaces

# Remove unused namespaces
kubectl delete namespace old-app

# Clean up ArgoCD applications
kubectl get applications -n argocd
kubectl delete application old-app -n argocd

# Clean up old Docker images (on nodes)
# SSH to nodes and run:
docker image prune -a --filter "until=720h"  # 30 days

DNS Management

Adding New Subdomain:

  1. Add DNS A record pointing to Traefik LoadBalancer IP

    # Get LoadBalancer IP
    kubectl get svc -n traefik traefik -o jsonpath='{.status.loadBalancer.ingress[0].ip}'
    
  2. Add to DNS provider:

    myapp.forteapps.net  A  <LoadBalancer-IP>
    
  3. Verify DNS propagation:

    nslookup myapp.forteapps.net
    dig myapp.forteapps.net
    

Monitoring Resource Usage

# Node resource usage
kubectl top nodes

# Pod resource usage
kubectl top pods --all-namespaces

# Identify resource hogs
kubectl top pods --all-namespaces --sort-by=memory
kubectl top pods --all-namespaces --sort-by=cpu

Advanced Operations

Adding a New Infrastructure Component

Example: Adding Redis

# 1. Create application manifest
cat > infra/redis-application.yaml <<EOF
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: redis
  namespace: argocd
  annotations:
    argocd.argoproj.io/sync-wave: "1"
spec:
  project: default
  source:
    repoURL: https://charts.bitnami.com/bitnami
    chart: redis
    targetRevision: 18.0.0
    helm:
      values: |
        auth:
          enabled: true
          password: changeme
  destination:
    server: https://kubernetes.default.svc
    namespace: redis
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
    - CreateNamespace=true
EOF

# 2. Commit and push
git add infra/redis-application.yaml
git commit -m "Add Redis infrastructure component"
git push

# 3. ArgoCD will auto-sync within 60 seconds

Multi-Cluster Setup (Future)

For multi-cluster deployments:

# Different destinations per environment
# dev-cluster
destination:
  server: https://dev.k8s.example.com
  namespace: myapp

# prod-cluster
destination:
  server: https://prod.k8s.example.com
  namespace: myapp

Blue-Green Deployments

# Deploy blue version
helm install myapp-blue forteapp \
  --set app.image.tag=v1.0.0

# Deploy green version
helm install myapp-green forteapp \
  --set app.image.tag=v2.0.0

# Switch traffic via IngressRoute
kubectl patch ingressroute myapp -n myapp --type merge \
  -p '{"spec":{"routes":[{"services":[{"name":"myapp-green"}]}]}}'

# Remove blue deployment after validation
helm uninstall myapp-blue

Emergency Procedures

Emergency Rollback

# Immediate rollback
kubectl rollout undo deployment myapp -n myapp

# Update Git to make permanent
cd ~/dev/k8s/helm-prod-values
git revert HEAD
git push

Emergency Scale Down

# Scale to zero (maintenance mode)
kubectl scale deployment myapp -n myapp --replicas=0

# Update Git
vim helm-values/myapp/values.yaml
# Set replicaCount: 0
git commit -am "Scale down myapp for maintenance"
git push

Emergency Application Removal

# Remove application but keep data
kubectl patch application myapp -n argocd \
  -p '{"metadata":{"finalizers":[]}}' --type merge
kubectl delete application myapp -n argocd

# Resources remain in cluster

Useful Scripts

Sync All Applications

#!/bin/bash
# sync-all.sh
for app in $(kubectl get applications -n argocd -o name); do
  kubectl patch $app -n argocd \
    --type merge \
    -p '{"metadata":{"annotations":{"argocd.argoproj.io/refresh":"hard"}}}'
done

Check All Applications Health

#!/bin/bash
# health-check.sh
kubectl get applications -n argocd \
  -o custom-columns=\
NAME:.metadata.name,\
SYNC:.status.sync.status,\
HEALTH:.status.health.status,\
MESSAGE:.status.health.message

Seal Secret Helper

#!/bin/bash
# seal-secret.sh
NAMESPACE=${1:-default}
SECRET_FILE=${2:-private/secret.yaml}
OUTPUT_FILE=${3:-secrets/secret-sealed.yaml}

kubeseal --format=yaml \
  --cert=pub-cert.pem \
  --namespace=$NAMESPACE \
  < $SECRET_FILE \
  > $OUTPUT_FILE

echo "Sealed secret created: $OUTPUT_FILE"
echo "Remember to delete: $SECRET_FILE"

Checklist Templates

New Application Deployment Checklist

  • Application code repository created
  • Dockerfile created and tested
  • GitHub Actions workflow configured
  • Helm values created in helm-prod-values/
  • ArgoCD application manifest created in apps/
  • Secrets created and sealed
  • DNS record added for domain
  • Application synced successfully
  • Health check passed
  • Slack notification received
  • Application accessible via domain
  • Monitoring configured
  • Documentation updated

Incident Response Checklist

  • Incident identified (Slack alert, monitoring)
  • Severity assessed
  • Incident channel created
  • Initial investigation (logs, metrics, events)
  • Root cause identified
  • Mitigation applied
  • Verification of fix
  • Post-mortem scheduled
  • Documentation updated

Last Updated: 2026-03-16 Maintained By: Platform Team Emergency Contact: #platform-support on Slack