# Operations Runbook ## Table of Contents - [Overview](#overview) - [Cluster Bootstrap](#cluster-bootstrap) - [Initial Cluster Setup](#initial-cluster-setup) - [ArgoCD Repository Access Setup](#argocd-repository-access-setup) - [Day-to-Day Operations](#day-to-day-operations) - [Application Management](#application-management) - [Secret Management](#secret-management) - [Monitoring & Alerting](#monitoring--alerting) - [Troubleshooting](#troubleshooting) - [Disaster Recovery](#disaster-recovery) - [Maintenance Procedures](#maintenance-procedures) --- ## Overview This runbook provides operational procedures for maintaining the Kubernetes cluster and managing applications. It's intended for platform engineers and operators with full cluster access. ### Operator Prerequisites - ✅ Full kubectl access to cluster - ✅ Write access to all Git repositories - ✅ ArgoCD UI access - ✅ Slack notifications configured - ✅ Understanding of Kubernetes concepts --- ## Cluster Bootstrap ### Initial Cluster Setup Bootstrap a new cluster from scratch: #### Prerequisites 1. **Kubernetes cluster running** (UpCloud or any K8s cluster) 2. **kubectl configured** with admin access 3. **Repositories cloned** locally ```bash # Verify cluster access kubectl cluster-info kubectl get nodes ``` #### Bootstrap Procedure ```bash # 1. Clone config repository git clone https://github.com/fortedigital/sturdy-adventure.git cd sturdy-adventure # 2. Set cluster name (optional) export CLUSTER_NAME="prod-cluster-01" # 3. Run bootstrap script ./bootstrap.sh ``` **What Happens:** 1. ✅ Installs ArgoCD via Helm 2. ✅ Configures ArgoCD with custom values 3. ✅ Applies root App-of-Apps manifest 4. ✅ ArgoCD automatically syncs all applications 5. ✅ Infrastructure and apps deploy in waves #### Verify Bootstrap ```bash # Wait for ArgoCD to be ready kubectl wait --for=condition=available --timeout=300s \ deployment/argocd-server -n argocd # Check ArgoCD applications kubectl get applications -n argocd # Expected output: infrastructure-apps, enterprise-apps, and all child apps ``` #### Post-Bootstrap Steps 1. **Configure DNS** for ingress domains: - `argocd.127.0.0.1.nip.io` (local dev) - `*.forteapps.net` (production) 2. **Verify Let's Encrypt certificates**: ```bash kubectl get certificate --all-namespaces kubectl get clusterissuer ``` 3. **Check Kyverno policies**: ```bash kubectl get clusterpolicy ``` 4. **Verify monitoring stack**: ```bash kubectl get pods -n monitoring ``` 5. **Test Slack notifications** by triggering a sync ### ArgoCD Repository Access Setup ArgoCD needs SSH access to private Git repositories to pull manifests and Helm values. This section covers setting up deploy keys for GitHub repositories. #### Why Deploy Keys? - **Read-only access**: Deploy keys provide secure, read-only access to repositories - **No user credentials**: No need to share personal SSH keys or tokens - **Repository-specific**: Each repository gets its own key for better security - **Revocable**: Easy to revoke access without affecting other repositories #### Prerequisites - kubectl access to the cluster - Write access to the GitHub repository - ArgoCD installed and running #### Setup Procedure **Step 1: Generate SSH Key Pair** Generate a dedicated SSH key for ArgoCD without a passphrase (required for automated access): ```bash # Generate ED25519 key (recommended - smaller and more secure) ssh-keygen -t ed25519 -C "argocd-deploy-key-sturdy-adventure" -f argocd-deploy-key -N "" # Or RSA key if ED25519 is not supported ssh-keygen -t rsa -b 4096 -C "argocd-deploy-key-sturdy-adventure" -f argocd-deploy-key -N "" ``` This creates two files: - `argocd-deploy-key` - Private key (keep secret) - `argocd-deploy-key.pub` - Public key (add to GitHub) **Step 2: Add Public Key to GitHub** 1. Copy the public key: ```bash cat argocd-deploy-key.pub ``` 2. Go to GitHub repository settings: - Navigate to: `https://github.com/fortedigital/sturdy-adventure/settings/keys` - Or: Repository → Settings → Deploy keys 3. Click **"Add deploy key"** - Title: `ArgoCD Production Cluster` - Key: Paste the public key content - ☐ Allow write access (leave unchecked - read-only is sufficient) - Click **"Add key"** 4. Repeat for the `helm-values` repository if it's private: ```bash # Generate separate key for helm-values repo ssh-keygen -t ed25519 -C "argocd-deploy-key-helm-values" -f argocd-helm-values-key -N "" # Add to: https://github.com/fortedigital/helm-values/settings/keys ``` **Step 3: Create Kubernetes Secret** Add the private key to ArgoCD as a repository secret: Save the following file in private/ (gitignored) folder as secret.yaml ```bash apiVersion: v1 kind: Secret metadata: name: forte-helm-repo namespace: argocd labels: argocd.argoproj.io/secret-type: repository stringData: type: git url: git@github.com:fortedigital/forte-helm.git sshPrivateKey: | project: default ``` Seal the secret using `kubeseal` command ```bash kubeseal --format=yaml \ --namespace=argocd \ < private/secret.yaml \ > secrets/forte-helm-repo-secret-sealed.yaml ``` **Step 4: Register Repository in ArgoCD** Check in secrets/forte-helm-repo-secret-sealed.yaml and let Argo sync and create the secret. **Step 5: Verify Repository Access** ```bash # Check if repository is connected kubectl get secrets -n argocd -l argocd.argoproj.io/secret-type=repository # Verify connection in ArgoCD UI # Settings → Repositories → Should show "Successful" status # Test by creating an application kubectl apply -f _app-of-apps.yaml # Check application sync status kubectl get applications -n argocd ``` #### Testing Repository Access Create a test application to verify SSH access: ```bash cat > /tmp/test-repo-access.yaml <> ~/.ssh/known_hosts # Or disable strict host key checking (less secure) kubectl patch secret repo-sturdy-adventure -n argocd \ --type merge \ -p '{"stringData":{"insecure":"true"}}' ``` **Issue: Repository shows as "Unknown" status** ```bash # Check repository server logs kubectl logs -n argocd deployment/argocd-repo-server # Refresh repository connection kubectl delete secret repo-sturdy-adventure -n argocd # Recreate secret (see Step 3 above) # Restart ArgoCD components kubectl rollout restart deployment argocd-repo-server -n argocd kubectl rollout restart deployment argocd-application-controller -n argocd ``` #### Multiple Repository Setup For the three-repository pattern (sturdy-adventure, forte-helm, helm-values): ```bash # 1. sturdy-adventure (main config repo) ssh-keygen -t ed25519 -C "argocd-sturdy-adventure" -f key-sturdy -N "" # Add key-sturdy.pub to: https://github.com/fortedigital/sturdy-adventure/settings/keys # 2. helm-values (private values repo) ssh-keygen -t ed25519 -C "argocd-helm-values" -f key-helm-values -N "" # Add key-helm-values.pub to: https://github.com/fortedigital/helm-values/settings/keys # 3. forte-helm (private helm charts repo) # Create secrets kubectl create secret generic repo-sturdy-adventure \ --from-file=sshPrivateKey=key-sturdy \ --namespace=argocd --dry-run=client -o yaml | \ kubectl label --local -f - argocd.argoproj.io/secret-type=repository --dry-run=client -o yaml | \ kubectl apply -f - kubectl create secret generic repo-helm-values \ --from-file=sshPrivateKey=key-helm-values \ --namespace=argocd --dry-run=client -o yaml | \ kubectl label --local -f - argocd.argoproj.io/secret-type=repository --dry-run=client -o yaml | \ kubectl apply -f - # Clean up keys shred -u key-sturdy key-helm-values ``` #### Converting HTTPS to SSH If you're currently using HTTPS and want to switch to SSH: ```bash # 1. Generate and add deploy key (see steps above) # 2. Update all Application manifests # Change from: # repoURL: https://github.com/fortedigital/sturdy-adventure.git # To: # repoURL: git@github.com:fortedigital/sturdy-adventure.git # 3. Update and commit find . -name "*.yaml" -type f -exec sed -i 's|https://github.com/fortedigital/|git@github.com:fortedigital/|g' {} + git add . git commit -m "Switch from HTTPS to SSH for repository access" git push # 4. ArgoCD will automatically re-sync with new SSH URLs ``` --- ## Day-to-Day Operations ### Monitoring ArgoCD Sync Status #### Via Slack All applications send notifications to shared Slack channel: - ✅ `on-sync-succeeded` - Deployment succeeded - ❌ `on-sync-failed` - Deployment failed - ⚠️ `on-degraded` - Application unhealthy #### Via CLI ```bash # List all applications kubectl get applications -n argocd # Watch application status kubectl get applications -n argocd -w # Get detailed status kubectl describe application myapp -n argocd ``` #### Via ArgoCD UI ```bash # Port forward to UI kubectl port-forward svc/argocd-server -n argocd 8080:443 # Access: https://localhost:8080 # No login required (insecure mode for internal use) ``` ### Checking Application Health ```bash # Quick health check for all apps kubectl get applications -n argocd \ -o custom-columns=NAME:.metadata.name,SYNC:.status.sync.status,HEALTH:.status.health.status # Expected output: # NAME SYNC HEALTH # infrastructure-apps Synced Healthy # enterprise-apps Synced Healthy # mcp10x Synced Healthy # musicman Synced Healthy ``` ### Manual Sync Force sync an application: ```bash # Trigger sync kubectl patch application myapp -n argocd \ --type merge \ -p '{"metadata":{"annotations":{"argocd.argoproj.io/refresh":"hard"}}}' # Or via ArgoCD CLI (if installed) argocd app sync myapp ``` ### Pausing Auto-Sync Temporarily disable automatic syncing: ```bash # Edit application kubectl edit application myapp -n argocd # Set automated to null spec: syncPolicy: automated: null # Disable auto-sync # Re-enable later spec: syncPolicy: automated: prune: true selfHeal: true ``` --- ## Application Management ### Deploying a New Application See [Developer Guide](DEVELOPER-GUIDE.md#deploying-your-first-application) for detailed steps. **Quick checklist:** - [ ] Create `helm-values/myapp/values.yaml` - [ ] Create `apps/myapp.yaml` in config repo - [ ] Create SealedSecret if needed - [ ] Commit and push changes - [ ] Verify sync in Slack/ArgoCD - [ ] Configure DNS for domain - [ ] Test application accessibility ### Removing an Application #### Safe Removal Procedure ```bash # 1. Delete ArgoCD Application (with cascade) kubectl delete application myapp -n argocd # This will: # - Remove application from ArgoCD # - Delete all Kubernetes resources (cascade) # - Remove namespace # 2. Clean up Git repositories cd ~/dev/k8s/launchpad git rm apps/myapp.yaml git commit -m "Remove myapp application" git push cd ~/dev/k8s/helm-prod-values git rm -r myapp/ git commit -m "Remove myapp values" git push # 3. Remove sealed secrets (if any) cd ~/dev/k8s/launchpad git rm secrets/myapp-credentials-sealed.yaml git commit -m "Remove myapp secrets" git push ``` #### Removal Without Cascade To remove from ArgoCD but keep resources running: ```bash # Delete application with no cascade kubectl patch application myapp -n argocd \ -p '{"metadata":{"finalizers":[]}}' --type merge kubectl delete application myapp -n argocd # Resources remain in cluster but are no longer managed ``` ### Scaling Applications #### Manual Scaling ```bash # Scale deployment directly kubectl scale deployment myapp -n myapp --replicas=3 # Note: If selfHeal is enabled, this will be reverted ``` #### GitOps Scaling Update `helm-values/myapp/values.yaml`: ```yaml app: replicaCount: 3 # Change from 1 to 3 ``` Commit and push - ArgoCD will sync. #### Auto-Scaling (HPA) Enable Horizontal Pod Autoscaler: ```yaml # In helm-values/myapp/values.yaml app: hpa: enabled: true minReplicas: 2 maxReplicas: 10 targetCPUUtilizationPercentage: 70 ``` **Note:** Remove `replicaCount` from ArgoCD ignore list if using HPA: ```yaml # In apps/myapp.yaml ignoreDifferences: - group: apps kind: Deployment jsonPointers: - /spec/replicas # Remove this line ``` ### Rolling Back Deployments #### Option 1: Git Revert ```bash # Find the commit before the bad change cd ~/dev/k8s/helm-prod-values git log --oneline myapp/values.yaml # Revert to previous version git revert git push # ArgoCD will sync the rollback ``` #### Option 2: Manual Rollback ```bash # Rollback to previous revision kubectl rollout undo deployment myapp -n myapp # Note: This will be reverted by ArgoCD selfHeal # Make permanent by updating Git ``` #### Option 3: Change Image Tag ```bash # Edit helm-values cd ~/dev/k8s/helm-prod-values vim myapp/values.yaml # Change image tag to previous version app: image: tag: v1.0.0 # Roll back from v1.0.1 # Commit and push git add myapp/values.yaml git commit -m "Rollback myapp to v1.0.0" git push ``` ### Resource Updates #### Update Resource Limits ```yaml # In helm-values/myapp/values.yaml app: resources: requests: cpu: 200m # Increased from 100m memory: 512Mi # Increased from 256Mi limits: cpu: 1000m memory: 2Gi ``` #### Enable Database ```yaml # In helm-values/myapp/values.yaml db: enabled: true persistence: size: 10Gi # Increase storage ``` --- ## Secret Management ### Creating Secrets #### Step 1: Get Public Certificate ```bash # Fetch sealed-secrets public cert (one-time) kubeseal --fetch-cert \ --controller-name=sealed-secrets-controller \ --controller-namespace=kube-system \ > pub-cert.pem # Save this certificate for future use ``` #### Step 2: Create Plain Secret ```bash # Method 1: From literal values kubectl create secret generic myapp-credentials \ --from-literal=API_KEY=secret123 \ --from-literal=DB_PASSWORD=pass456 \ --namespace=myapp \ --dry-run=client -o yaml > private/myapp-credentials.yaml # Method 2: From file kubectl create secret generic myapp-credentials \ --from-file=.env \ --namespace=myapp \ --dry-run=client -o yaml > private/myapp-credentials.yaml # Method 3: From multiple files kubectl create secret generic myapp-credentials \ --from-file=api-key.txt \ --from-file=db-password.txt \ --namespace=myapp \ --dry-run=client -o yaml > private/myapp-credentials.yaml ``` #### Step 3: Seal Secret ```bash kubeseal --format=yaml \ --cert=pub-cert.pem \ --namespace=myapp \ < private/myapp-credentials.yaml \ > secrets/myapp-credentials-sealed.yaml ``` #### Step 4: Commit Sealed Secret ```bash git add secrets/myapp-credentials-sealed.yaml git commit -m "Add myapp credentials" git push # Delete plain secret rm private/myapp-credentials.yaml ``` ### Updating Secrets ```bash # 1. Create new version kubectl create secret generic myapp-credentials \ --from-literal=API_KEY=new-secret-key \ --from-literal=DB_PASSWORD=new-password \ --namespace=myapp \ --dry-run=client -o yaml > private/myapp-credentials.yaml # 2. Seal it kubeseal --format=yaml \ --cert=pub-cert.pem \ --namespace=myapp \ < private/myapp-credentials.yaml \ > secrets/myapp-credentials-sealed.yaml # 3. Commit git add secrets/myapp-credentials-sealed.yaml git commit -m "Update myapp credentials" git push # 4. Restart pods to pick up new secret kubectl rollout restart deployment myapp -n myapp # 5. Delete plain secret rm private/myapp-credentials.yaml ``` ### Viewing Secrets (Unsealed) ```bash # List secrets in namespace kubectl get secrets -n myapp # Describe secret (doesn't show values) kubectl describe secret myapp-credentials -n myapp # View secret values (base64 encoded) kubectl get secret myapp-credentials -n myapp -o yaml # Decode secret value kubectl get secret myapp-credentials -n myapp \ -o jsonpath='{.data.API_KEY}' | base64 -d ``` ### Secret Cloning (Kyverno) Secrets labeled `allowedToBeCloned: "true"` in the `secrets` namespace are automatically cloned to new namespaces. ```yaml # Example: secrets-namespace.yaml apiVersion: v1 kind: Secret metadata: name: shared-credentials namespace: secrets labels: allowedToBeCloned: "true" type: Opaque data: API_KEY: ``` When a new namespace is created, Kyverno automatically copies this secret. ### Authentication Secrets Applications using the authentication sidecar require specific secrets depending on the auth mode. #### Token Mode Secrets Token-based auth uses an `auth-tokens` Secret: ```bash # Method 1: From Helm values (automatic) # Tokens specified in values.yaml are automatically created # Method 2: Manual creation kubectl create secret generic auth-tokens \ --from-literal=tokens="token1 token2 token3" \ --namespace=myapp # Method 3: From file echo "d4f88f6d9292c10cc3e21c4aad56d2be485db532b54fe961d738e1137d247823" > tokens.txt echo "8803f621acc3898df1d7a8f514bc3602551a0681a8f747bd4e43c3c5849d57a7" >> tokens.txt kubectl create secret generic auth-tokens \ --from-file=tokens=tokens.txt \ --namespace=myapp rm tokens.txt ``` #### OIDC Mode Secrets OIDC auth requires an `auth-oidc` Secret with two keys: ```bash # Generate secrets CLIENT_SECRET="your-oidc-client-secret-from-provider" COOKIE_SECRET=$(openssl rand -hex 32) # Create plain secret kubectl create secret generic auth-oidc \ --from-literal=client-secret=$CLIENT_SECRET \ --from-literal=cookie-secret=$COOKIE_SECRET \ --namespace=myapp \ --dry-run=client -o yaml > private/myapp-auth-oidc.yaml # Seal it kubeseal --format=yaml \ --cert=pub-cert.pem \ --namespace=myapp \ < private/myapp-auth-oidc.yaml \ > secrets/myapp-auth-oidc-sealed.yaml # Apply sealed secret kubectl apply -f secrets/myapp-auth-oidc-sealed.yaml # Commit to Git git add secrets/myapp-auth-oidc-sealed.yaml git commit -m "Add OIDC secrets for myapp" git push # Clean up rm private/myapp-auth-oidc.yaml ``` #### Rotating Authentication Secrets **Token Rotation**: ```bash # Generate new token NEW_TOKEN=$(openssl rand -hex 32) # Get current tokens kubectl get secret auth-tokens -n myapp -o yaml > /tmp/tokens.yaml # Edit tokens (add new, optionally remove old) # Then re-seal and apply # Restart pods to use new tokens kubectl rollout restart deployment myapp -n myapp ``` **OIDC Secret Rotation**: ```bash # Rotate cookie secret (safe - invalidates existing sessions) NEW_COOKIE_SECRET=$(openssl rand -hex 32) # Recreate secret kubectl create secret generic auth-oidc \ --from-literal=client-secret=$CLIENT_SECRET \ --from-literal=cookie-secret=$NEW_COOKIE_SECRET \ --namespace=myapp \ --dry-run=client -o yaml | \ kubeseal --format=yaml --cert=pub-cert.pem --namespace=myapp | \ kubectl apply -f - # Restart to pick up new secret kubectl rollout restart deployment myapp -n myapp ``` #### Viewing Authentication Secrets ```bash # List auth-related secrets kubectl get secrets -n myapp | grep auth # View token secret (tokens are in plain text in the Secret) kubectl get secret auth-tokens -n myapp -o jsonpath='{.data.tokens}' | base64 -d # View OIDC secret keys (values are base64 encoded) kubectl get secret auth-oidc -n myapp -o jsonpath='{.data.client-secret}' | base64 -d kubectl get secret auth-oidc -n myapp -o jsonpath='{.data.cookie-secret}' | base64 -d ``` **See**: [Developer Guide - Enabling Authentication](../docs/DEVELOPER-GUIDE.md#enabling-authentication-for-applications) for complete authentication setup guide. --- ## Monitoring & Alerting ### Prometheus Metrics ```bash # Port forward to Prometheus kubectl port-forward -n monitoring svc/prometheus-server 9090:80 # Access: http://localhost:9090 ``` **Common Queries:** ```promql # CPU usage per pod sum(rate(container_cpu_usage_seconds_total[5m])) by (pod) # Memory usage per pod sum(container_memory_usage_bytes) by (pod) # Request rate per service rate(http_requests_total[5m]) ``` ### Grafana Dashboards ```bash # Port forward to Grafana kubectl port-forward -n monitoring svc/grafana 3000:80 # Access: http://localhost:3000 ``` ### Loki Logs ```bash # Port forward to Loki kubectl port-forward -n monitoring svc/loki 3100:3100 # Query logs curl -G -s 'http://localhost:3100/loki/api/v1/query_range' \ --data-urlencode 'query={namespace="myapp"}' \ --data-urlencode 'start=1h' | jq ``` ### Fluent-Bit Log Shipping Verify Fluent-Bit is shipping logs: ```bash # Check Fluent-Bit pods kubectl get pods -n monitoring | grep fluent-bit # Check logs kubectl logs -n monitoring daemonset/fluent-bit # Verify Loki is receiving logs kubectl logs -n monitoring deployment/loki | grep "POST /loki/api/v1/push" ``` ### Trivy Vulnerability Scanning ```bash # Check Trivy scan results kubectl get vulnerabilityreports --all-namespaces # View report for specific pod kubectl describe vulnerabilityreport -n myapp ``` ### Slack Notifications All applications have Slack notifications enabled: ```yaml metadata: annotations: notifications.argoproj.io/subscribe.on-sync-succeeded.slack: "" notifications.argoproj.io/subscribe.on-sync-failed.slack: "" notifications.argoproj.io/subscribe.on-degraded.slack: "" ``` **Test Notification:** ```bash # Trigger a sync to test kubectl patch application myapp -n argocd \ --type merge \ -p '{"metadata":{"annotations":{"argocd.argoproj.io/refresh":"hard"}}}' ``` --- ## Troubleshooting ### Application Won't Sync #### Check Application Status ```bash kubectl describe application myapp -n argocd ``` Look for errors in: - `Status.Conditions` - `Status.OperationState` #### Common Issues **Issue 1: Image Pull Error** ```bash # Error: ErrImagePull, ImagePullBackOff # Check if image exists docker pull ghcr.io/fortedigital/myapp:v1.0.0 # Check image pull secrets kubectl get secrets -n myapp | grep regcred # Check pod events kubectl describe pod -n myapp ``` **Issue 2: Invalid YAML** ```bash # Error: unable to decode manifest # Validate YAML locally kubectl apply --dry-run=client -f apps/myapp.yaml # Check ArgoCD application controller logs kubectl logs -n argocd deployment/argocd-application-controller | grep myapp ``` **Issue 3: Resource Quota Exceeded** ```bash # Error: exceeded quota # Check namespace quotas kubectl get resourcequota -n myapp kubectl describe resourcequota -n myapp # Increase quota or reduce resource requests ``` ### Pod Crashes #### CrashLoopBackOff ```bash # Check pod status kubectl get pods -n myapp # View logs kubectl logs -n myapp kubectl logs -n myapp --previous # Previous container # Check events kubectl describe pod -n myapp ``` **Common Causes:** - Application error (check logs) - Missing environment variables - Wrong port configuration - Missing secrets - Insufficient memory/CPU #### ImagePullBackOff ```bash # Check image name kubectl get deployment myapp -n myapp -o yaml | grep image # Verify credentials kubectl get secret -n myapp ``` #### Pending ```bash # Check why pod is pending kubectl describe pod -n myapp # Common reasons: # - Insufficient resources on nodes # - PVC not bound # - Node selector doesn't match ``` ### Ingress / TLS Issues #### Application Not Accessible ```bash # Check IngressRoute kubectl get ingressroute -n myapp kubectl describe ingressroute myapp -n myapp # Check Traefik kubectl get pods -n traefik kubectl logs -n traefik deployment/traefik # Test with port-forward kubectl port-forward -n myapp service/myapp 8080:3000 curl http://localhost:8080 ``` #### Certificate Issues ```bash # Check certificates kubectl get certificate -n myapp kubectl describe certificate myapp-tls -n myapp # Check cert-manager kubectl get clusterissuer kubectl logs -n cert-manager deployment/cert-manager # Check Let's Encrypt challenges kubectl get challenges --all-namespaces ``` **Manual Certificate Renewal:** ```bash # Delete and recreate certificate kubectl delete certificate myapp-tls -n myapp # Certificate will be automatically recreated ``` ### Database Issues #### PostgreSQL Won't Start ```bash # Check StatefulSet kubectl get statefulset -n myapp kubectl describe statefulset postgres -n myapp # Check PVC kubectl get pvc -n myapp kubectl describe pvc -n myapp # Check logs kubectl logs -n myapp postgres-0 ``` #### Data Persistence ```bash # Verify PVC is bound kubectl get pvc -n myapp # Check storage class kubectl get storageclass # Resize PVC (if supported) kubectl edit pvc postgres-data-postgres-0 -n myapp # Change: storage: 10Gi (from 5Gi) ``` ### Kyverno Policy Issues #### Policy Violations ```bash # List policies kubectl get clusterpolicy # Check policy reports kubectl get policyreport --all-namespaces # View specific policy kubectl describe clusterpolicy secret-cloner ``` #### Secret Not Cloned ```bash # Check if secret has label kubectl get secret -n secrets --show-labels # Check Kyverno logs kubectl logs -n kyverno deployment/kyverno # Manually trigger by recreating namespace kubectl delete ns test-ns kubectl create ns test-ns ``` ### ArgoCD Issues #### ArgoCD UI Not Accessible ```bash # Check ArgoCD pods kubectl get pods -n argocd # Restart ArgoCD server kubectl rollout restart deployment argocd-server -n argocd # Port forward kubectl port-forward svc/argocd-server -n argocd 8080:443 ``` #### Sync Takes Too Long ```bash # Check application controller logs kubectl logs -n argocd deployment/argocd-application-controller # Increase timeout (in apps/myapp.yaml) spec: syncPolicy: retry: backoff: maxDuration: 5m # Increase from 3m ``` --- ## Disaster Recovery ### Backup Strategy **Current State**: No automated backups **What Needs Backup**: - ❌ Cluster state (not backed up - recreate via GitOps) - ❌ Persistent volumes (currently not critical) - ✅ Git repositories (GitHub provides backup) - ⚠️ Secrets (sealed secrets in Git, unseal keys need safekeeping) ### Cluster Rebuild **Scenario**: Complete cluster failure ```bash # 1. Provision new Kubernetes cluster # 2. Configure kubectl kubectl config use-context new-cluster kubectl cluster-info # 3. Bootstrap cluster cd ~/dev/k8s/launchpad ./bootstrap.sh # 4. Wait for ArgoCD to sync all applications kubectl get applications -n argocd -w # 5. Recreate any unsealed secrets (from password manager) # 6. Configure DNS for new cluster IPs # 7. Verify all applications are healthy ``` **Time Estimate**: 30-60 minutes **Data Loss**: - Ephemeral data: Lost - Database data: Lost (no backups currently) - Configuration: No loss (in Git) ### Future Backup Plan **Recommended**: 1. **Velero** for cluster backups ```bash helm install velero vmware-tanzu/velero \ --namespace velero \ --create-namespace \ --set configuration.provider=aws \ --set configuration.backupStorageLocation[0].bucket=cluster-backups ``` 2. **PostgreSQL backups** via CronJob ```yaml # pg-backup-cronjob.yaml kind: CronJob spec: schedule: "0 2 * * *" # Daily at 2am jobTemplate: spec: template: spec: containers: - name: pg-dump image: postgres:16-alpine command: - /bin/sh - -c - pg_dump -U $DB_USER -d $DB_NAME > /backup/dump-$(date +%Y%m%d).sql ``` 3. **Sealed Secrets private key backup** ```bash # Backup sealed-secrets controller private key kubectl get secret -n kube-system sealed-secrets-key \ -o yaml > sealed-secrets-key-backup.yaml # Store in secure location (password manager, vault) ``` --- ## Maintenance Procedures ### Upgrading ArgoCD ```bash # Check current version kubectl get deployment argocd-server -n argocd \ -o jsonpath='{.spec.template.spec.containers[0].image}' # Update version in values vim infra/values/argocd-values.yaml # Or upgrade via Helm directly helm upgrade argocd argo-cd \ --repo https://argoproj.github.io/argo-helm \ --namespace argocd \ --values infra/values/argocd-values.yaml \ --version 6.0.0 # New version # Verify kubectl get pods -n argocd ``` ### Upgrading Kubernetes Version ```bash # UpCloud: Upgrade via control panel or CLI # After upgrade, verify cluster kubectl version kubectl get nodes # Check for deprecated APIs kubectl api-resources # Update any deprecated resources in Git ``` ### Rotating TLS Certificates Let's Encrypt certificates auto-renew, but if manual rotation is needed: ```bash # Delete certificate to force renewal kubectl delete certificate myapp-tls -n myapp # Cert-manager will automatically recreate kubectl get certificate -n myapp -w ``` ### Cleaning Up Old Resources ```bash # List all namespaces kubectl get namespaces # Remove unused namespaces kubectl delete namespace old-app # Clean up ArgoCD applications kubectl get applications -n argocd kubectl delete application old-app -n argocd # Clean up old Docker images (on nodes) # SSH to nodes and run: docker image prune -a --filter "until=720h" # 30 days ``` ### DNS Management **Adding New Subdomain**: 1. Add DNS A record pointing to Traefik LoadBalancer IP ```bash # Get LoadBalancer IP kubectl get svc -n traefik traefik -o jsonpath='{.status.loadBalancer.ingress[0].ip}' ``` 2. Add to DNS provider: ``` myapp.forteapps.net A ``` 3. Verify DNS propagation: ```bash nslookup myapp.forteapps.net dig myapp.forteapps.net ``` ### Monitoring Resource Usage ```bash # Node resource usage kubectl top nodes # Pod resource usage kubectl top pods --all-namespaces # Identify resource hogs kubectl top pods --all-namespaces --sort-by=memory kubectl top pods --all-namespaces --sort-by=cpu ``` --- ## Advanced Operations ### Adding a New Infrastructure Component Example: Adding Redis ```bash # 1. Create application manifest cat > infra/redis-application.yaml < $OUTPUT_FILE echo "Sealed secret created: $OUTPUT_FILE" echo "Remember to delete: $SECRET_FILE" ``` --- ## Checklist Templates ### New Application Deployment Checklist - [ ] Application code repository created - [ ] Dockerfile created and tested - [ ] GitHub Actions workflow configured - [ ] Helm values created in `helm-prod-values/` - [ ] ArgoCD application manifest created in `apps/` - [ ] Secrets created and sealed - [ ] DNS record added for domain - [ ] Application synced successfully - [ ] Health check passed - [ ] Slack notification received - [ ] Application accessible via domain - [ ] Monitoring configured - [ ] Documentation updated ### Incident Response Checklist - [ ] Incident identified (Slack alert, monitoring) - [ ] Severity assessed - [ ] Incident channel created - [ ] Initial investigation (logs, metrics, events) - [ ] Root cause identified - [ ] Mitigation applied - [ ] Verification of fix - [ ] Post-mortem scheduled - [ ] Documentation updated --- **Last Updated**: 2026-03-16 **Maintained By**: Platform Team **Emergency Contact**: #platform-support on Slack