37 KiB
Operations Runbook
Table of Contents
- Overview
- Cluster Bootstrap
- Day-to-Day Operations
- Application Management
- Secret Management
- Monitoring & Alerting
- Troubleshooting
- Disaster Recovery
- Maintenance Procedures
Overview
This runbook provides operational procedures for maintaining the Kubernetes cluster and managing applications. It's intended for platform engineers and operators with full cluster access.
Operator Prerequisites
- ✅ Full kubectl access to cluster
- ✅ Write access to all Git repositories
- ✅ ArgoCD UI access
- ✅ Slack notifications configured
- ✅ Understanding of Kubernetes concepts
Cluster Bootstrap
Initial Cluster Setup
Bootstrap a new cluster from scratch:
Prerequisites
- Kubernetes cluster running (UpCloud or any K8s cluster)
- kubectl configured with admin access
- Repositories cloned locally
# Verify cluster access
kubectl cluster-info
kubectl get nodes
Bootstrap Procedure
# 1. Clone config repository
git clone https://github.com/fortedigital/sturdy-adventure.git
cd sturdy-adventure
# 2. Set cluster name (optional)
export CLUSTER_NAME="prod-cluster-01"
# 3. Run bootstrap script
./bootstrap.sh
What Happens:
- ✅ Installs ArgoCD via Helm
- ✅ Configures ArgoCD with custom values
- ✅ Applies root App-of-Apps manifest
- ✅ ArgoCD automatically syncs all applications
- ✅ Infrastructure and apps deploy in waves
Verify Bootstrap
# Wait for ArgoCD to be ready
kubectl wait --for=condition=available --timeout=300s \
deployment/argocd-server -n argocd
# Check ArgoCD applications
kubectl get applications -n argocd
# Expected output: infrastructure-apps, enterprise-apps, and all child apps
Post-Bootstrap Steps
-
Configure DNS for ingress domains:
argocd.127.0.0.1.nip.io(local dev)*.forteapps.net(production)
-
Verify Let's Encrypt certificates:
kubectl get certificate --all-namespaces kubectl get clusterissuer -
Check Kyverno policies:
kubectl get clusterpolicy -
Verify monitoring stack:
kubectl get pods -n monitoring -
Test Slack notifications by triggering a sync
ArgoCD Repository Access Setup
ArgoCD needs SSH access to private Git repositories to pull manifests and Helm values. This section covers setting up deploy keys for GitHub repositories.
Why Deploy Keys?
- Read-only access: Deploy keys provide secure, read-only access to repositories
- No user credentials: No need to share personal SSH keys or tokens
- Repository-specific: Each repository gets its own key for better security
- Revocable: Easy to revoke access without affecting other repositories
Prerequisites
- kubectl access to the cluster
- Write access to the GitHub repository
- ArgoCD installed and running
Setup Procedure
Step 1: Generate SSH Key Pair
Generate a dedicated SSH key for ArgoCD without a passphrase (required for automated access):
# Generate ED25519 key (recommended - smaller and more secure)
ssh-keygen -t ed25519 -C "argocd-deploy-key-sturdy-adventure" -f argocd-deploy-key -N ""
# Or RSA key if ED25519 is not supported
ssh-keygen -t rsa -b 4096 -C "argocd-deploy-key-sturdy-adventure" -f argocd-deploy-key -N ""
This creates two files:
argocd-deploy-key- Private key (keep secret)argocd-deploy-key.pub- Public key (add to GitHub)
Step 2: Add Public Key to GitHub
-
Copy the public key:
cat argocd-deploy-key.pub -
Go to GitHub repository settings:
- Navigate to:
https://github.com/fortedigital/sturdy-adventure/settings/keys - Or: Repository → Settings → Deploy keys
- Navigate to:
-
Click "Add deploy key"
- Title:
ArgoCD Production Cluster - Key: Paste the public key content
- ☐ Allow write access (leave unchecked - read-only is sufficient)
- Click "Add key"
- Title:
-
Repeat for the
helm-valuesrepository if it's private:# Generate separate key for helm-values repo ssh-keygen -t ed25519 -C "argocd-deploy-key-helm-values" -f argocd-helm-values-key -N "" # Add to: https://github.com/fortedigital/helm-values/settings/keys
Step 3: Create Kubernetes Secret
Add the private key to ArgoCD as a repository secret:
Save the following file in private/ (gitignored) folder as secret.yaml
apiVersion: v1
kind: Secret
metadata:
name: forte-helm-repo
namespace: argocd
labels:
argocd.argoproj.io/secret-type: repository
stringData:
type: git
url: git@github.com:fortedigital/forte-helm.git
sshPrivateKey: |
<paste your private key here>
project: default
Seal the secret using kubeseal command
kubeseal --format=yaml \
--namespace=argocd \
< private/secret.yaml \
> secrets/forte-helm-repo-secret-sealed.yaml
Step 4: Register Repository in ArgoCD
Check in secrets/forte-helm-repo-secret-sealed.yaml and let Argo sync and create the secret.
Step 5: Verify Repository Access
# Check if repository is connected
kubectl get secrets -n argocd -l argocd.argoproj.io/secret-type=repository
# Verify connection in ArgoCD UI
# Settings → Repositories → Should show "Successful" status
# Test by creating an application
kubectl apply -f _app-of-apps.yaml
# Check application sync status
kubectl get applications -n argocd
Testing Repository Access
Create a test application to verify SSH access:
cat > /tmp/test-repo-access.yaml <<EOF
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: test-repo-access
namespace: argocd
spec:
project: default
source:
repoURL: git@github.com:fortedigital/sturdy-adventure.git
targetRevision: main
path: cluster-resources
destination:
server: https://kubernetes.default.svc
namespace: default
syncPolicy:
automated: null # Manual sync for testing
EOF
kubectl apply -f /tmp/test-repo-access.yaml
# Check if ArgoCD can access the repository
kubectl describe application test-repo-access -n argocd
# Look for sync status - should show repository contents
kubectl get application test-repo-access -n argocd -o jsonpath='{.status.sync.status}'
# Clean up test application
kubectl delete application test-repo-access -n argocd
rm /tmp/test-repo-access.yaml
Security Best Practices
-
Secure Private Keys
# Store private key securely and delete local copy # Option 1: Store in password manager (recommended) # Option 2: Backup to encrypted storage # Delete local private key after adding to Kubernetes shred -u argocd-deploy-key # Or on Windows # Remove-Item -Path argocd-deploy-key -Force -
Rotate Keys Regularly
# Generate new key ssh-keygen -t ed25519 -C "argocd-deploy-key-$(date +%Y%m)" -f argocd-new-key -N "" # Add new public key to GitHub (keep old key for now) # Update Kubernetes secret kubectl create secret generic repo-sturdy-adventure \ --from-file=sshPrivateKey=argocd-new-key \ --namespace=argocd \ --dry-run=client -o yaml | kubectl apply -f - # Test access, then remove old deploy key from GitHub # Clean up shred -u argocd-new-key -
Audit Repository Access
# List all repository secrets kubectl get secrets -n argocd -l argocd.argoproj.io/secret-type=repository # Review deploy keys in GitHub # Visit: https://github.com/fortedigital/sturdy-adventure/settings/keys -
Use Different Keys per Repository
- Don't reuse the same deploy key across repositories
- If one key is compromised, only one repository is affected
- Easier to track and audit access
Troubleshooting Repository Access
Issue: "permission denied (publickey)"
# Check if secret exists
kubectl get secret repo-sturdy-adventure -n argocd
# Verify secret has correct label
kubectl get secret repo-sturdy-adventure -n argocd -o yaml | grep argocd.argoproj.io/secret-type
# Check ArgoCD application controller logs
kubectl logs -n argocd deployment/argocd-application-controller | grep -i "permission denied"
# Verify deploy key is added to GitHub
# Visit: https://github.com/fortedigital/sturdy-adventure/settings/keys
Issue: "Host key verification failed"
# Add GitHub to known_hosts
kubectl exec -n argocd deployment/argocd-repo-server -- \
ssh-keyscan github.com >> ~/.ssh/known_hosts
# Or disable strict host key checking (less secure)
kubectl patch secret repo-sturdy-adventure -n argocd \
--type merge \
-p '{"stringData":{"insecure":"true"}}'
Issue: Repository shows as "Unknown" status
# Check repository server logs
kubectl logs -n argocd deployment/argocd-repo-server
# Refresh repository connection
kubectl delete secret repo-sturdy-adventure -n argocd
# Recreate secret (see Step 3 above)
# Restart ArgoCD components
kubectl rollout restart deployment argocd-repo-server -n argocd
kubectl rollout restart deployment argocd-application-controller -n argocd
Multiple Repository Setup
For the three-repository pattern (sturdy-adventure, forte-helm, helm-values):
# 1. sturdy-adventure (main config repo)
ssh-keygen -t ed25519 -C "argocd-sturdy-adventure" -f key-sturdy -N ""
# Add key-sturdy.pub to: https://github.com/fortedigital/sturdy-adventure/settings/keys
# 2. helm-values (private values repo)
ssh-keygen -t ed25519 -C "argocd-helm-values" -f key-helm-values -N ""
# Add key-helm-values.pub to: https://github.com/fortedigital/helm-values/settings/keys
# 3. forte-helm (private helm charts repo)
# Create secrets
kubectl create secret generic repo-sturdy-adventure \
--from-file=sshPrivateKey=key-sturdy \
--namespace=argocd --dry-run=client -o yaml | \
kubectl label --local -f - argocd.argoproj.io/secret-type=repository --dry-run=client -o yaml | \
kubectl apply -f -
kubectl create secret generic repo-helm-values \
--from-file=sshPrivateKey=key-helm-values \
--namespace=argocd --dry-run=client -o yaml | \
kubectl label --local -f - argocd.argoproj.io/secret-type=repository --dry-run=client -o yaml | \
kubectl apply -f -
# Clean up keys
shred -u key-sturdy key-helm-values
Converting HTTPS to SSH
If you're currently using HTTPS and want to switch to SSH:
# 1. Generate and add deploy key (see steps above)
# 2. Update all Application manifests
# Change from:
# repoURL: https://github.com/fortedigital/sturdy-adventure.git
# To:
# repoURL: git@github.com:fortedigital/sturdy-adventure.git
# 3. Update and commit
find . -name "*.yaml" -type f -exec sed -i 's|https://github.com/fortedigital/|git@github.com:fortedigital/|g' {} +
git add .
git commit -m "Switch from HTTPS to SSH for repository access"
git push
# 4. ArgoCD will automatically re-sync with new SSH URLs
Day-to-Day Operations
Monitoring ArgoCD Sync Status
Via Slack
All applications send notifications to shared Slack channel:
- ✅
on-sync-succeeded- Deployment succeeded - ❌
on-sync-failed- Deployment failed - ⚠️
on-degraded- Application unhealthy
Via CLI
# List all applications
kubectl get applications -n argocd
# Watch application status
kubectl get applications -n argocd -w
# Get detailed status
kubectl describe application myapp -n argocd
Via ArgoCD UI
# Port forward to UI
kubectl port-forward svc/argocd-server -n argocd 8080:443
# Access: https://localhost:8080
# No login required (insecure mode for internal use)
Checking Application Health
# Quick health check for all apps
kubectl get applications -n argocd \
-o custom-columns=NAME:.metadata.name,SYNC:.status.sync.status,HEALTH:.status.health.status
# Expected output:
# NAME SYNC HEALTH
# infrastructure-apps Synced Healthy
# enterprise-apps Synced Healthy
# mcp10x Synced Healthy
# musicman Synced Healthy
Manual Sync
Force sync an application:
# Trigger sync
kubectl patch application myapp -n argocd \
--type merge \
-p '{"metadata":{"annotations":{"argocd.argoproj.io/refresh":"hard"}}}'
# Or via ArgoCD CLI (if installed)
argocd app sync myapp
Pausing Auto-Sync
Temporarily disable automatic syncing:
# Edit application
kubectl edit application myapp -n argocd
# Set automated to null
spec:
syncPolicy:
automated: null # Disable auto-sync
# Re-enable later
spec:
syncPolicy:
automated:
prune: true
selfHeal: true
Application Management
Deploying a New Application
See Developer Guide for detailed steps.
Quick checklist:
- Create
helm-values/myapp/values.yaml - Create
apps/myapp.yamlin config repo - Create SealedSecret if needed
- Commit and push changes
- Verify sync in Slack/ArgoCD
- Configure DNS for domain
- Test application accessibility
Removing an Application
Safe Removal Procedure
# 1. Delete ArgoCD Application (with cascade)
kubectl delete application myapp -n argocd
# This will:
# - Remove application from ArgoCD
# - Delete all Kubernetes resources (cascade)
# - Remove namespace
# 2. Clean up Git repositories
cd ~/dev/k8s/launchpad
git rm apps/myapp.yaml
git commit -m "Remove myapp application"
git push
cd ~/dev/k8s/helm-prod-values
git rm -r myapp/
git commit -m "Remove myapp values"
git push
# 3. Remove sealed secrets (if any)
cd ~/dev/k8s/launchpad
git rm secrets/myapp-credentials-sealed.yaml
git commit -m "Remove myapp secrets"
git push
Removal Without Cascade
To remove from ArgoCD but keep resources running:
# Delete application with no cascade
kubectl patch application myapp -n argocd \
-p '{"metadata":{"finalizers":[]}}' --type merge
kubectl delete application myapp -n argocd
# Resources remain in cluster but are no longer managed
Scaling Applications
Manual Scaling
# Scale deployment directly
kubectl scale deployment myapp -n myapp --replicas=3
# Note: If selfHeal is enabled, this will be reverted
GitOps Scaling
Update helm-values/myapp/values.yaml:
app:
replicaCount: 3 # Change from 1 to 3
Commit and push - ArgoCD will sync.
Auto-Scaling (HPA)
Enable Horizontal Pod Autoscaler:
# In helm-values/myapp/values.yaml
app:
hpa:
enabled: true
minReplicas: 2
maxReplicas: 10
targetCPUUtilizationPercentage: 70
Note: Remove replicaCount from ArgoCD ignore list if using HPA:
# In apps/myapp.yaml
ignoreDifferences:
- group: apps
kind: Deployment
jsonPointers:
- /spec/replicas # Remove this line
Rolling Back Deployments
Option 1: Git Revert
# Find the commit before the bad change
cd ~/dev/k8s/helm-prod-values
git log --oneline myapp/values.yaml
# Revert to previous version
git revert <commit-hash>
git push
# ArgoCD will sync the rollback
Option 2: Manual Rollback
# Rollback to previous revision
kubectl rollout undo deployment myapp -n myapp
# Note: This will be reverted by ArgoCD selfHeal
# Make permanent by updating Git
Option 3: Change Image Tag
# Edit helm-values
cd ~/dev/k8s/helm-prod-values
vim myapp/values.yaml
# Change image tag to previous version
app:
image:
tag: v1.0.0 # Roll back from v1.0.1
# Commit and push
git add myapp/values.yaml
git commit -m "Rollback myapp to v1.0.0"
git push
Resource Updates
Update Resource Limits
# In helm-values/myapp/values.yaml
app:
resources:
requests:
cpu: 200m # Increased from 100m
memory: 512Mi # Increased from 256Mi
limits:
cpu: 1000m
memory: 2Gi
Enable Database
# In helm-values/myapp/values.yaml
db:
enabled: true
persistence:
size: 10Gi # Increase storage
Secret Management
Creating Secrets
Step 1: Get Public Certificate
# Fetch sealed-secrets public cert (one-time)
kubeseal --fetch-cert \
--controller-name=sealed-secrets-controller \
--controller-namespace=kube-system \
> pub-cert.pem
# Save this certificate for future use
Step 2: Create Plain Secret
# Method 1: From literal values
kubectl create secret generic myapp-credentials \
--from-literal=API_KEY=secret123 \
--from-literal=DB_PASSWORD=pass456 \
--namespace=myapp \
--dry-run=client -o yaml > private/myapp-credentials.yaml
# Method 2: From file
kubectl create secret generic myapp-credentials \
--from-file=.env \
--namespace=myapp \
--dry-run=client -o yaml > private/myapp-credentials.yaml
# Method 3: From multiple files
kubectl create secret generic myapp-credentials \
--from-file=api-key.txt \
--from-file=db-password.txt \
--namespace=myapp \
--dry-run=client -o yaml > private/myapp-credentials.yaml
Step 3: Seal Secret
kubeseal --format=yaml \
--cert=pub-cert.pem \
--namespace=myapp \
< private/myapp-credentials.yaml \
> secrets/myapp-credentials-sealed.yaml
Step 4: Commit Sealed Secret
git add secrets/myapp-credentials-sealed.yaml
git commit -m "Add myapp credentials"
git push
# Delete plain secret
rm private/myapp-credentials.yaml
Updating Secrets
# 1. Create new version
kubectl create secret generic myapp-credentials \
--from-literal=API_KEY=new-secret-key \
--from-literal=DB_PASSWORD=new-password \
--namespace=myapp \
--dry-run=client -o yaml > private/myapp-credentials.yaml
# 2. Seal it
kubeseal --format=yaml \
--cert=pub-cert.pem \
--namespace=myapp \
< private/myapp-credentials.yaml \
> secrets/myapp-credentials-sealed.yaml
# 3. Commit
git add secrets/myapp-credentials-sealed.yaml
git commit -m "Update myapp credentials"
git push
# 4. Restart pods to pick up new secret
kubectl rollout restart deployment myapp -n myapp
# 5. Delete plain secret
rm private/myapp-credentials.yaml
Viewing Secrets (Unsealed)
# List secrets in namespace
kubectl get secrets -n myapp
# Describe secret (doesn't show values)
kubectl describe secret myapp-credentials -n myapp
# View secret values (base64 encoded)
kubectl get secret myapp-credentials -n myapp -o yaml
# Decode secret value
kubectl get secret myapp-credentials -n myapp \
-o jsonpath='{.data.API_KEY}' | base64 -d
Secret Cloning (Kyverno)
Secrets labeled allowedToBeCloned: "true" in the secrets namespace are automatically cloned to new namespaces.
# Example: secrets-namespace.yaml
apiVersion: v1
kind: Secret
metadata:
name: shared-credentials
namespace: secrets
labels:
allowedToBeCloned: "true"
type: Opaque
data:
API_KEY: <base64-encoded-value>
When a new namespace is created, Kyverno automatically copies this secret.
Authentication Secrets
Applications using the authentication sidecar require specific secrets depending on the auth mode.
Token Mode Secrets
Token-based auth uses an auth-tokens Secret:
# Method 1: From Helm values (automatic)
# Tokens specified in values.yaml are automatically created
# Method 2: Manual creation
kubectl create secret generic auth-tokens \
--from-literal=tokens="token1
token2
token3" \
--namespace=myapp
# Method 3: From file
echo "d4f88f6d9292c10cc3e21c4aad56d2be485db532b54fe961d738e1137d247823" > tokens.txt
echo "8803f621acc3898df1d7a8f514bc3602551a0681a8f747bd4e43c3c5849d57a7" >> tokens.txt
kubectl create secret generic auth-tokens \
--from-file=tokens=tokens.txt \
--namespace=myapp
rm tokens.txt
OIDC Mode Secrets
OIDC auth requires an auth-oidc Secret with two keys:
# Generate secrets
CLIENT_SECRET="your-oidc-client-secret-from-provider"
COOKIE_SECRET=$(openssl rand -hex 32)
# Create plain secret
kubectl create secret generic auth-oidc \
--from-literal=client-secret=$CLIENT_SECRET \
--from-literal=cookie-secret=$COOKIE_SECRET \
--namespace=myapp \
--dry-run=client -o yaml > private/myapp-auth-oidc.yaml
# Seal it
kubeseal --format=yaml \
--cert=pub-cert.pem \
--namespace=myapp \
< private/myapp-auth-oidc.yaml \
> secrets/myapp-auth-oidc-sealed.yaml
# Apply sealed secret
kubectl apply -f secrets/myapp-auth-oidc-sealed.yaml
# Commit to Git
git add secrets/myapp-auth-oidc-sealed.yaml
git commit -m "Add OIDC secrets for myapp"
git push
# Clean up
rm private/myapp-auth-oidc.yaml
Rotating Authentication Secrets
Token Rotation:
# Generate new token
NEW_TOKEN=$(openssl rand -hex 32)
# Get current tokens
kubectl get secret auth-tokens -n myapp -o yaml > /tmp/tokens.yaml
# Edit tokens (add new, optionally remove old)
# Then re-seal and apply
# Restart pods to use new tokens
kubectl rollout restart deployment myapp -n myapp
OIDC Secret Rotation:
# Rotate cookie secret (safe - invalidates existing sessions)
NEW_COOKIE_SECRET=$(openssl rand -hex 32)
# Recreate secret
kubectl create secret generic auth-oidc \
--from-literal=client-secret=$CLIENT_SECRET \
--from-literal=cookie-secret=$NEW_COOKIE_SECRET \
--namespace=myapp \
--dry-run=client -o yaml | \
kubeseal --format=yaml --cert=pub-cert.pem --namespace=myapp | \
kubectl apply -f -
# Restart to pick up new secret
kubectl rollout restart deployment myapp -n myapp
Viewing Authentication Secrets
# List auth-related secrets
kubectl get secrets -n myapp | grep auth
# View token secret (tokens are in plain text in the Secret)
kubectl get secret auth-tokens -n myapp -o jsonpath='{.data.tokens}' | base64 -d
# View OIDC secret keys (values are base64 encoded)
kubectl get secret auth-oidc -n myapp -o jsonpath='{.data.client-secret}' | base64 -d
kubectl get secret auth-oidc -n myapp -o jsonpath='{.data.cookie-secret}' | base64 -d
See: Developer Guide - Enabling Authentication for complete authentication setup guide.
Monitoring & Alerting
Prometheus Metrics
# Port forward to Prometheus
kubectl port-forward -n monitoring svc/prometheus-server 9090:80
# Access: http://localhost:9090
Common Queries:
# CPU usage per pod
sum(rate(container_cpu_usage_seconds_total[5m])) by (pod)
# Memory usage per pod
sum(container_memory_usage_bytes) by (pod)
# Request rate per service
rate(http_requests_total[5m])
Grafana Dashboards
# Port forward to Grafana
kubectl port-forward -n monitoring svc/grafana 3000:80
# Access: http://localhost:3000
Loki Logs
# Port forward to Loki
kubectl port-forward -n monitoring svc/loki 3100:3100
# Query logs
curl -G -s 'http://localhost:3100/loki/api/v1/query_range' \
--data-urlencode 'query={namespace="myapp"}' \
--data-urlencode 'start=1h' | jq
Tempo Traces
# Port forward to Tempo query API
kubectl port-forward -n monitoring svc/tempo 3200:3200
# Access: http://localhost:3200
Query traces via Grafana:
- Open Grafana → Explore
- Select Tempo datasource
- Use TraceQL or search by service name
Verify Traefik is sending traces:
# Check Traefik logs for OTLP export errors
kubectl logs -n traefik-system -l app.kubernetes.io/name=traefik | grep -i "traces export"
# Check Tempo is receiving data
kubectl logs -n monitoring -l app.kubernetes.io/name=tempo | grep "receiver"
Trace-to-log correlation:
- Click a trace span in Grafana → linked Loki logs appear (by namespace, pod, container)
- Trace-to-metrics links to Prometheus by service name
Fluent-Bit Log Shipping
Verify Fluent-Bit is shipping logs:
# Check Fluent-Bit pods
kubectl get pods -n monitoring | grep fluent-bit
# Check logs
kubectl logs -n monitoring daemonset/fluent-bit
# Verify Loki is receiving logs
kubectl logs -n monitoring deployment/loki | grep "POST /loki/api/v1/push"
Trivy Vulnerability Scanning
# Check Trivy scan results
kubectl get vulnerabilityreports --all-namespaces
# View report for specific pod
kubectl describe vulnerabilityreport -n myapp <report-name>
Slack Notifications
All applications have Slack notifications enabled:
metadata:
annotations:
notifications.argoproj.io/subscribe.on-sync-succeeded.slack: ""
notifications.argoproj.io/subscribe.on-sync-failed.slack: ""
notifications.argoproj.io/subscribe.on-degraded.slack: ""
Test Notification:
# Trigger a sync to test
kubectl patch application myapp -n argocd \
--type merge \
-p '{"metadata":{"annotations":{"argocd.argoproj.io/refresh":"hard"}}}'
Troubleshooting
Application Won't Sync
Check Application Status
kubectl describe application myapp -n argocd
Look for errors in:
Status.ConditionsStatus.OperationState
Common Issues
Issue 1: Image Pull Error
# Error: ErrImagePull, ImagePullBackOff
# Check if image exists
docker pull ghcr.io/fortedigital/myapp:v1.0.0
# Check image pull secrets
kubectl get secrets -n myapp | grep regcred
# Check pod events
kubectl describe pod -n myapp <pod-name>
Issue 2: Invalid YAML
# Error: unable to decode manifest
# Validate YAML locally
kubectl apply --dry-run=client -f apps/myapp.yaml
# Check ArgoCD application controller logs
kubectl logs -n argocd deployment/argocd-application-controller | grep myapp
Issue 3: Resource Quota Exceeded
# Error: exceeded quota
# Check namespace quotas
kubectl get resourcequota -n myapp
kubectl describe resourcequota -n myapp
# Increase quota or reduce resource requests
Pod Crashes
CrashLoopBackOff
# Check pod status
kubectl get pods -n myapp
# View logs
kubectl logs -n myapp <pod-name>
kubectl logs -n myapp <pod-name> --previous # Previous container
# Check events
kubectl describe pod -n myapp <pod-name>
Common Causes:
- Application error (check logs)
- Missing environment variables
- Wrong port configuration
- Missing secrets
- Insufficient memory/CPU
ImagePullBackOff
# Check image name
kubectl get deployment myapp -n myapp -o yaml | grep image
# Verify credentials
kubectl get secret -n myapp
Pending
# Check why pod is pending
kubectl describe pod -n myapp <pod-name>
# Common reasons:
# - Insufficient resources on nodes
# - PVC not bound
# - Node selector doesn't match
Ingress / TLS Issues
Application Not Accessible
# Check IngressRoute
kubectl get ingressroute -n myapp
kubectl describe ingressroute myapp -n myapp
# Check Traefik
kubectl get pods -n traefik
kubectl logs -n traefik deployment/traefik
# Test with port-forward
kubectl port-forward -n myapp service/myapp 8080:3000
curl http://localhost:8080
Certificate Issues
# Check certificates
kubectl get certificate -n myapp
kubectl describe certificate myapp-tls -n myapp
# Check cert-manager
kubectl get clusterissuer
kubectl logs -n cert-manager deployment/cert-manager
# Check Let's Encrypt challenges
kubectl get challenges --all-namespaces
Manual Certificate Renewal:
# Delete and recreate certificate
kubectl delete certificate myapp-tls -n myapp
# Certificate will be automatically recreated
Database Issues
PostgreSQL Won't Start
# Check StatefulSet
kubectl get statefulset -n myapp
kubectl describe statefulset postgres -n myapp
# Check PVC
kubectl get pvc -n myapp
kubectl describe pvc -n myapp
# Check logs
kubectl logs -n myapp postgres-0
Data Persistence
# Verify PVC is bound
kubectl get pvc -n myapp
# Check storage class
kubectl get storageclass
# Resize PVC (if supported)
kubectl edit pvc postgres-data-postgres-0 -n myapp
# Change: storage: 10Gi (from 5Gi)
Kyverno Policy Issues
Policy Violations
# List policies
kubectl get clusterpolicy
# Check policy reports
kubectl get policyreport --all-namespaces
# View specific policy
kubectl describe clusterpolicy secret-cloner
Secret Not Cloned
# Check if secret has label
kubectl get secret -n secrets --show-labels
# Check Kyverno logs
kubectl logs -n kyverno deployment/kyverno
# Manually trigger by recreating namespace
kubectl delete ns test-ns
kubectl create ns test-ns
ArgoCD Issues
ArgoCD UI Not Accessible
# Check ArgoCD pods
kubectl get pods -n argocd
# Restart ArgoCD server
kubectl rollout restart deployment argocd-server -n argocd
# Port forward
kubectl port-forward svc/argocd-server -n argocd 8080:443
Sync Takes Too Long
# Check application controller logs
kubectl logs -n argocd deployment/argocd-application-controller
# Increase timeout (in apps/myapp.yaml)
spec:
syncPolicy:
retry:
backoff:
maxDuration: 5m # Increase from 3m
Disaster Recovery
Backup Strategy
Current State: No automated backups
What Needs Backup:
- ❌ Cluster state (not backed up - recreate via GitOps)
- ❌ Persistent volumes (currently not critical)
- ✅ Git repositories (GitHub provides backup)
- ⚠️ Secrets (sealed secrets in Git, unseal keys need safekeeping)
Cluster Rebuild
Scenario: Complete cluster failure
# 1. Provision new Kubernetes cluster
# 2. Configure kubectl
kubectl config use-context new-cluster
kubectl cluster-info
# 3. Bootstrap cluster
cd ~/dev/k8s/launchpad
./bootstrap.sh
# 4. Wait for ArgoCD to sync all applications
kubectl get applications -n argocd -w
# 5. Recreate any unsealed secrets (from password manager)
# 6. Configure DNS for new cluster IPs
# 7. Verify all applications are healthy
Time Estimate: 30-60 minutes
Data Loss:
- Ephemeral data: Lost
- Database data: Lost (no backups currently)
- Configuration: No loss (in Git)
Future Backup Plan
Recommended:
-
Velero for cluster backups
helm install velero vmware-tanzu/velero \ --namespace velero \ --create-namespace \ --set configuration.provider=aws \ --set configuration.backupStorageLocation[0].bucket=cluster-backups -
PostgreSQL backups via CronJob
# pg-backup-cronjob.yaml kind: CronJob spec: schedule: "0 2 * * *" # Daily at 2am jobTemplate: spec: template: spec: containers: - name: pg-dump image: postgres:16-alpine command: - /bin/sh - -c - pg_dump -U $DB_USER -d $DB_NAME > /backup/dump-$(date +%Y%m%d).sql -
Sealed Secrets private key backup
# Backup sealed-secrets controller private key kubectl get secret -n kube-system sealed-secrets-key \ -o yaml > sealed-secrets-key-backup.yaml # Store in secure location (password manager, vault)
Maintenance Procedures
Upgrading ArgoCD
# Check current version
kubectl get deployment argocd-server -n argocd \
-o jsonpath='{.spec.template.spec.containers[0].image}'
# Update version in values
vim infra/values/argocd-values.yaml
# Or upgrade via Helm directly
helm upgrade argocd argo-cd \
--repo https://argoproj.github.io/argo-helm \
--namespace argocd \
--values infra/values/argocd-values.yaml \
--version 6.0.0 # New version
# Verify
kubectl get pods -n argocd
Upgrading Kubernetes Version
# UpCloud: Upgrade via control panel or CLI
# After upgrade, verify cluster
kubectl version
kubectl get nodes
# Check for deprecated APIs
kubectl api-resources
# Update any deprecated resources in Git
Rotating TLS Certificates
Let's Encrypt certificates auto-renew, but if manual rotation is needed:
# Delete certificate to force renewal
kubectl delete certificate myapp-tls -n myapp
# Cert-manager will automatically recreate
kubectl get certificate -n myapp -w
Cleaning Up Old Resources
# List all namespaces
kubectl get namespaces
# Remove unused namespaces
kubectl delete namespace old-app
# Clean up ArgoCD applications
kubectl get applications -n argocd
kubectl delete application old-app -n argocd
# Clean up old Docker images (on nodes)
# SSH to nodes and run:
docker image prune -a --filter "until=720h" # 30 days
DNS Management
Adding New Subdomain:
-
Add DNS A record pointing to Traefik LoadBalancer IP
# Get LoadBalancer IP kubectl get svc -n traefik traefik -o jsonpath='{.status.loadBalancer.ingress[0].ip}' -
Add to DNS provider:
myapp.forteapps.net A <LoadBalancer-IP> -
Verify DNS propagation:
nslookup myapp.forteapps.net dig myapp.forteapps.net
Monitoring Resource Usage
# Node resource usage
kubectl top nodes
# Pod resource usage
kubectl top pods --all-namespaces
# Identify resource hogs
kubectl top pods --all-namespaces --sort-by=memory
kubectl top pods --all-namespaces --sort-by=cpu
Advanced Operations
Adding a New Infrastructure Component
Example: Adding Redis
# 1. Create application manifest
cat > infra/redis-application.yaml <<EOF
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: redis
namespace: argocd
annotations:
argocd.argoproj.io/sync-wave: "1"
spec:
project: default
source:
repoURL: https://charts.bitnami.com/bitnami
chart: redis
targetRevision: 18.0.0
helm:
values: |
auth:
enabled: true
password: changeme
destination:
server: https://kubernetes.default.svc
namespace: redis
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
EOF
# 2. Commit and push
git add infra/redis-application.yaml
git commit -m "Add Redis infrastructure component"
git push
# 3. ArgoCD will auto-sync within 60 seconds
Multi-Cluster Setup (Future)
For multi-cluster deployments:
# Different destinations per environment
# dev-cluster
destination:
server: https://dev.k8s.example.com
namespace: myapp
# prod-cluster
destination:
server: https://prod.k8s.example.com
namespace: myapp
Blue-Green Deployments
# Deploy blue version
helm install myapp-blue forteapp \
--set app.image.tag=v1.0.0
# Deploy green version
helm install myapp-green forteapp \
--set app.image.tag=v2.0.0
# Switch traffic via IngressRoute
kubectl patch ingressroute myapp -n myapp --type merge \
-p '{"spec":{"routes":[{"services":[{"name":"myapp-green"}]}]}}'
# Remove blue deployment after validation
helm uninstall myapp-blue
Emergency Procedures
Emergency Rollback
# Immediate rollback
kubectl rollout undo deployment myapp -n myapp
# Update Git to make permanent
cd ~/dev/k8s/helm-prod-values
git revert HEAD
git push
Emergency Scale Down
# Scale to zero (maintenance mode)
kubectl scale deployment myapp -n myapp --replicas=0
# Update Git
vim helm-values/myapp/values.yaml
# Set replicaCount: 0
git commit -am "Scale down myapp for maintenance"
git push
Emergency Application Removal
# Remove application but keep data
kubectl patch application myapp -n argocd \
-p '{"metadata":{"finalizers":[]}}' --type merge
kubectl delete application myapp -n argocd
# Resources remain in cluster
Useful Scripts
Sync All Applications
#!/bin/bash
# sync-all.sh
for app in $(kubectl get applications -n argocd -o name); do
kubectl patch $app -n argocd \
--type merge \
-p '{"metadata":{"annotations":{"argocd.argoproj.io/refresh":"hard"}}}'
done
Check All Applications Health
#!/bin/bash
# health-check.sh
kubectl get applications -n argocd \
-o custom-columns=\
NAME:.metadata.name,\
SYNC:.status.sync.status,\
HEALTH:.status.health.status,\
MESSAGE:.status.health.message
Seal Secret Helper
#!/bin/bash
# seal-secret.sh
NAMESPACE=${1:-default}
SECRET_FILE=${2:-private/secret.yaml}
OUTPUT_FILE=${3:-secrets/secret-sealed.yaml}
kubeseal --format=yaml \
--cert=pub-cert.pem \
--namespace=$NAMESPACE \
< $SECRET_FILE \
> $OUTPUT_FILE
echo "Sealed secret created: $OUTPUT_FILE"
echo "Remember to delete: $SECRET_FILE"
Checklist Templates
New Application Deployment Checklist
- Application code repository created
- Dockerfile created and tested
- GitHub Actions workflow configured
- Helm values created in
helm-prod-values/ - ArgoCD application manifest created in
apps/ - Secrets created and sealed
- DNS record added for domain
- Application synced successfully
- Health check passed
- Slack notification received
- Application accessible via domain
- Monitoring configured
- Documentation updated
Incident Response Checklist
- Incident identified (Slack alert, monitoring)
- Severity assessed
- Incident channel created
- Initial investigation (logs, metrics, events)
- Root cause identified
- Mitigation applied
- Verification of fix
- Post-mortem scheduled
- Documentation updated
Last Updated: 2026-03-16 Maintained By: Platform Team Emergency Contact: #platform-support on Slack