Files
launchpad/docs/OPERATIONS-RUNBOOK.md
Danijel Simeunovic d02da33700 docs
2026-03-16 11:00:42 +01:00

1218 lines
24 KiB
Markdown

# Operations Runbook
## Table of Contents
- [Overview](#overview)
- [Cluster Bootstrap](#cluster-bootstrap)
- [Day-to-Day Operations](#day-to-day-operations)
- [Application Management](#application-management)
- [Secret Management](#secret-management)
- [Monitoring & Alerting](#monitoring--alerting)
- [Troubleshooting](#troubleshooting)
- [Disaster Recovery](#disaster-recovery)
- [Maintenance Procedures](#maintenance-procedures)
---
## Overview
This runbook provides operational procedures for maintaining the Kubernetes cluster and managing applications. It's intended for platform engineers and operators with full cluster access.
### Operator Prerequisites
- ✅ Full kubectl access to cluster
- ✅ Write access to all Git repositories
- ✅ ArgoCD UI access
- ✅ Slack notifications configured
- ✅ Understanding of Kubernetes concepts
---
## Cluster Bootstrap
### Initial Cluster Setup
Bootstrap a new cluster from scratch:
#### Prerequisites
1. **Kubernetes cluster running** (UpCloud or any K8s cluster)
2. **kubectl configured** with admin access
3. **Repositories cloned** locally
```bash
# Verify cluster access
kubectl cluster-info
kubectl get nodes
```
#### Bootstrap Procedure
```bash
# 1. Clone config repository
git clone https://github.com/snothub/sturdy-adventure.git
cd sturdy-adventure
# 2. Set cluster name (optional)
export CLUSTER_NAME="prod-cluster-01"
# 3. Run bootstrap script
./bootstrap.sh
```
**What Happens:**
1. ✅ Installs ArgoCD via Helm
2. ✅ Configures ArgoCD with custom values
3. ✅ Applies root App-of-Apps manifest
4. ✅ ArgoCD automatically syncs all applications
5. ✅ Infrastructure and apps deploy in waves
#### Verify Bootstrap
```bash
# Wait for ArgoCD to be ready
kubectl wait --for=condition=available --timeout=300s \
deployment/argocd-server -n argocd
# Check ArgoCD applications
kubectl get applications -n argocd
# Expected output: infrastructure-apps, enterprise-apps, and all child apps
```
#### Post-Bootstrap Steps
1. **Configure DNS** for ingress domains:
- `argocd.127.0.0.1.nip.io` (local dev)
- `*.forteapps.net` (production)
2. **Verify Let's Encrypt certificates**:
```bash
kubectl get certificate --all-namespaces
kubectl get clusterissuer
```
3. **Check Kyverno policies**:
```bash
kubectl get clusterpolicy
```
4. **Verify monitoring stack**:
```bash
kubectl get pods -n monitoring
```
5. **Test Slack notifications** by triggering a sync
---
## Day-to-Day Operations
### Monitoring ArgoCD Sync Status
#### Via Slack
All applications send notifications to shared Slack channel:
- ✅ `on-sync-succeeded` - Deployment succeeded
- ❌ `on-sync-failed` - Deployment failed
- ⚠️ `on-degraded` - Application unhealthy
#### Via CLI
```bash
# List all applications
kubectl get applications -n argocd
# Watch application status
kubectl get applications -n argocd -w
# Get detailed status
kubectl describe application myapp -n argocd
```
#### Via ArgoCD UI
```bash
# Port forward to UI
kubectl port-forward svc/argocd-server -n argocd 8080:443
# Access: https://localhost:8080
# No login required (insecure mode for internal use)
```
### Checking Application Health
```bash
# Quick health check for all apps
kubectl get applications -n argocd \
-o custom-columns=NAME:.metadata.name,SYNC:.status.sync.status,HEALTH:.status.health.status
# Expected output:
# NAME SYNC HEALTH
# infrastructure-apps Synced Healthy
# enterprise-apps Synced Healthy
# mcp10x Synced Healthy
# musicman Synced Healthy
```
### Manual Sync
Force sync an application:
```bash
# Trigger sync
kubectl patch application myapp -n argocd \
--type merge \
-p '{"metadata":{"annotations":{"argocd.argoproj.io/refresh":"hard"}}}'
# Or via ArgoCD CLI (if installed)
argocd app sync myapp
```
### Pausing Auto-Sync
Temporarily disable automatic syncing:
```bash
# Edit application
kubectl edit application myapp -n argocd
# Set automated to null
spec:
syncPolicy:
automated: null # Disable auto-sync
# Re-enable later
spec:
syncPolicy:
automated:
prune: true
selfHeal: true
```
---
## Application Management
### Deploying a New Application
See [Developer Guide](DEVELOPER-GUIDE.md#deploying-your-first-application) for detailed steps.
**Quick checklist:**
- [ ] Create `helm-values/myapp/values.yaml`
- [ ] Create `apps/myapp.yaml` in config repo
- [ ] Create SealedSecret if needed
- [ ] Commit and push changes
- [ ] Verify sync in Slack/ArgoCD
- [ ] Configure DNS for domain
- [ ] Test application accessibility
### Removing an Application
#### Safe Removal Procedure
```bash
# 1. Delete ArgoCD Application (with cascade)
kubectl delete application myapp -n argocd
# This will:
# - Remove application from ArgoCD
# - Delete all Kubernetes resources (cascade)
# - Remove namespace
# 2. Clean up Git repositories
cd ~/dev/k8s/launchpad
git rm apps/myapp.yaml
git commit -m "Remove myapp application"
git push
cd ~/dev/k8s/helm-prod-values
git rm -r myapp/
git commit -m "Remove myapp values"
git push
# 3. Remove sealed secrets (if any)
cd ~/dev/k8s/launchpad
git rm secrets/myapp-credentials-sealed.yaml
git commit -m "Remove myapp secrets"
git push
```
#### Removal Without Cascade
To remove from ArgoCD but keep resources running:
```bash
# Delete application with no cascade
kubectl patch application myapp -n argocd \
-p '{"metadata":{"finalizers":[]}}' --type merge
kubectl delete application myapp -n argocd
# Resources remain in cluster but are no longer managed
```
### Scaling Applications
#### Manual Scaling
```bash
# Scale deployment directly
kubectl scale deployment myapp -n myapp --replicas=3
# Note: If selfHeal is enabled, this will be reverted
```
#### GitOps Scaling
Update `helm-values/myapp/values.yaml`:
```yaml
app:
replicaCount: 3 # Change from 1 to 3
```
Commit and push - ArgoCD will sync.
#### Auto-Scaling (HPA)
Enable Horizontal Pod Autoscaler:
```yaml
# In helm-values/myapp/values.yaml
app:
hpa:
enabled: true
minReplicas: 2
maxReplicas: 10
targetCPUUtilizationPercentage: 70
```
**Note:** Remove `replicaCount` from ArgoCD ignore list if using HPA:
```yaml
# In apps/myapp.yaml
ignoreDifferences:
- group: apps
kind: Deployment
jsonPointers:
- /spec/replicas # Remove this line
```
### Rolling Back Deployments
#### Option 1: Git Revert
```bash
# Find the commit before the bad change
cd ~/dev/k8s/helm-prod-values
git log --oneline myapp/values.yaml
# Revert to previous version
git revert <commit-hash>
git push
# ArgoCD will sync the rollback
```
#### Option 2: Manual Rollback
```bash
# Rollback to previous revision
kubectl rollout undo deployment myapp -n myapp
# Note: This will be reverted by ArgoCD selfHeal
# Make permanent by updating Git
```
#### Option 3: Change Image Tag
```bash
# Edit helm-values
cd ~/dev/k8s/helm-prod-values
vim myapp/values.yaml
# Change image tag to previous version
app:
image:
tag: v1.0.0 # Roll back from v1.0.1
# Commit and push
git add myapp/values.yaml
git commit -m "Rollback myapp to v1.0.0"
git push
```
### Resource Updates
#### Update Resource Limits
```yaml
# In helm-values/myapp/values.yaml
app:
resources:
requests:
cpu: 200m # Increased from 100m
memory: 512Mi # Increased from 256Mi
limits:
cpu: 1000m
memory: 2Gi
```
#### Enable Database
```yaml
# In helm-values/myapp/values.yaml
db:
enabled: true
persistence:
size: 10Gi # Increase storage
```
---
## Secret Management
### Creating Secrets
#### Step 1: Get Public Certificate
```bash
# Fetch sealed-secrets public cert (one-time)
kubeseal --fetch-cert \
--controller-name=sealed-secrets-controller \
--controller-namespace=kube-system \
> pub-cert.pem
# Save this certificate for future use
```
#### Step 2: Create Plain Secret
```bash
# Method 1: From literal values
kubectl create secret generic myapp-credentials \
--from-literal=API_KEY=secret123 \
--from-literal=DB_PASSWORD=pass456 \
--namespace=myapp \
--dry-run=client -o yaml > private/myapp-credentials.yaml
# Method 2: From file
kubectl create secret generic myapp-credentials \
--from-file=.env \
--namespace=myapp \
--dry-run=client -o yaml > private/myapp-credentials.yaml
# Method 3: From multiple files
kubectl create secret generic myapp-credentials \
--from-file=api-key.txt \
--from-file=db-password.txt \
--namespace=myapp \
--dry-run=client -o yaml > private/myapp-credentials.yaml
```
#### Step 3: Seal Secret
```bash
kubeseal --format=yaml \
--cert=pub-cert.pem \
--namespace=myapp \
< private/myapp-credentials.yaml \
> secrets/myapp-credentials-sealed.yaml
```
#### Step 4: Commit Sealed Secret
```bash
git add secrets/myapp-credentials-sealed.yaml
git commit -m "Add myapp credentials"
git push
# Delete plain secret
rm private/myapp-credentials.yaml
```
### Updating Secrets
```bash
# 1. Create new version
kubectl create secret generic myapp-credentials \
--from-literal=API_KEY=new-secret-key \
--from-literal=DB_PASSWORD=new-password \
--namespace=myapp \
--dry-run=client -o yaml > private/myapp-credentials.yaml
# 2. Seal it
kubeseal --format=yaml \
--cert=pub-cert.pem \
--namespace=myapp \
< private/myapp-credentials.yaml \
> secrets/myapp-credentials-sealed.yaml
# 3. Commit
git add secrets/myapp-credentials-sealed.yaml
git commit -m "Update myapp credentials"
git push
# 4. Restart pods to pick up new secret
kubectl rollout restart deployment myapp -n myapp
# 5. Delete plain secret
rm private/myapp-credentials.yaml
```
### Viewing Secrets (Unsealed)
```bash
# List secrets in namespace
kubectl get secrets -n myapp
# Describe secret (doesn't show values)
kubectl describe secret myapp-credentials -n myapp
# View secret values (base64 encoded)
kubectl get secret myapp-credentials -n myapp -o yaml
# Decode secret value
kubectl get secret myapp-credentials -n myapp \
-o jsonpath='{.data.API_KEY}' | base64 -d
```
### Secret Cloning (Kyverno)
Secrets labeled `allowedToBeCloned: "true"` in the `secrets` namespace are automatically cloned to new namespaces.
```yaml
# Example: secrets-namespace.yaml
apiVersion: v1
kind: Secret
metadata:
name: shared-credentials
namespace: secrets
labels:
allowedToBeCloned: "true"
type: Opaque
data:
API_KEY: <base64-encoded-value>
```
When a new namespace is created, Kyverno automatically copies this secret.
---
## Monitoring & Alerting
### Prometheus Metrics
```bash
# Port forward to Prometheus
kubectl port-forward -n monitoring svc/prometheus-server 9090:80
# Access: http://localhost:9090
```
**Common Queries:**
```promql
# CPU usage per pod
sum(rate(container_cpu_usage_seconds_total[5m])) by (pod)
# Memory usage per pod
sum(container_memory_usage_bytes) by (pod)
# Request rate per service
rate(http_requests_total[5m])
```
### Grafana Dashboards
```bash
# Port forward to Grafana
kubectl port-forward -n monitoring svc/grafana 3000:80
# Access: http://localhost:3000
```
### Loki Logs
```bash
# Port forward to Loki
kubectl port-forward -n monitoring svc/loki 3100:3100
# Query logs
curl -G -s 'http://localhost:3100/loki/api/v1/query_range' \
--data-urlencode 'query={namespace="myapp"}' \
--data-urlencode 'start=1h' | jq
```
### Fluent-Bit Log Shipping
Verify Fluent-Bit is shipping logs:
```bash
# Check Fluent-Bit pods
kubectl get pods -n monitoring | grep fluent-bit
# Check logs
kubectl logs -n monitoring daemonset/fluent-bit
# Verify Loki is receiving logs
kubectl logs -n monitoring deployment/loki | grep "POST /loki/api/v1/push"
```
### Trivy Vulnerability Scanning
```bash
# Check Trivy scan results
kubectl get vulnerabilityreports --all-namespaces
# View report for specific pod
kubectl describe vulnerabilityreport -n myapp <report-name>
```
### Slack Notifications
All applications have Slack notifications enabled:
```yaml
metadata:
annotations:
notifications.argoproj.io/subscribe.on-sync-succeeded.slack: ""
notifications.argoproj.io/subscribe.on-sync-failed.slack: ""
notifications.argoproj.io/subscribe.on-degraded.slack: ""
```
**Test Notification:**
```bash
# Trigger a sync to test
kubectl patch application myapp -n argocd \
--type merge \
-p '{"metadata":{"annotations":{"argocd.argoproj.io/refresh":"hard"}}}'
```
---
## Troubleshooting
### Application Won't Sync
#### Check Application Status
```bash
kubectl describe application myapp -n argocd
```
Look for errors in:
- `Status.Conditions`
- `Status.OperationState`
#### Common Issues
**Issue 1: Image Pull Error**
```bash
# Error: ErrImagePull, ImagePullBackOff
# Check if image exists
docker pull ghcr.io/fortedigital/myapp:v1.0.0
# Check image pull secrets
kubectl get secrets -n myapp | grep regcred
# Check pod events
kubectl describe pod -n myapp <pod-name>
```
**Issue 2: Invalid YAML**
```bash
# Error: unable to decode manifest
# Validate YAML locally
kubectl apply --dry-run=client -f apps/myapp.yaml
# Check ArgoCD application controller logs
kubectl logs -n argocd deployment/argocd-application-controller | grep myapp
```
**Issue 3: Resource Quota Exceeded**
```bash
# Error: exceeded quota
# Check namespace quotas
kubectl get resourcequota -n myapp
kubectl describe resourcequota -n myapp
# Increase quota or reduce resource requests
```
### Pod Crashes
#### CrashLoopBackOff
```bash
# Check pod status
kubectl get pods -n myapp
# View logs
kubectl logs -n myapp <pod-name>
kubectl logs -n myapp <pod-name> --previous # Previous container
# Check events
kubectl describe pod -n myapp <pod-name>
```
**Common Causes:**
- Application error (check logs)
- Missing environment variables
- Wrong port configuration
- Missing secrets
- Insufficient memory/CPU
#### ImagePullBackOff
```bash
# Check image name
kubectl get deployment myapp -n myapp -o yaml | grep image
# Verify credentials
kubectl get secret -n myapp
```
#### Pending
```bash
# Check why pod is pending
kubectl describe pod -n myapp <pod-name>
# Common reasons:
# - Insufficient resources on nodes
# - PVC not bound
# - Node selector doesn't match
```
### Ingress / TLS Issues
#### Application Not Accessible
```bash
# Check IngressRoute
kubectl get ingressroute -n myapp
kubectl describe ingressroute myapp -n myapp
# Check Traefik
kubectl get pods -n traefik
kubectl logs -n traefik deployment/traefik
# Test with port-forward
kubectl port-forward -n myapp service/myapp 8080:3000
curl http://localhost:8080
```
#### Certificate Issues
```bash
# Check certificates
kubectl get certificate -n myapp
kubectl describe certificate myapp-tls -n myapp
# Check cert-manager
kubectl get clusterissuer
kubectl logs -n cert-manager deployment/cert-manager
# Check Let's Encrypt challenges
kubectl get challenges --all-namespaces
```
**Manual Certificate Renewal:**
```bash
# Delete and recreate certificate
kubectl delete certificate myapp-tls -n myapp
# Certificate will be automatically recreated
```
### Database Issues
#### PostgreSQL Won't Start
```bash
# Check StatefulSet
kubectl get statefulset -n myapp
kubectl describe statefulset postgres -n myapp
# Check PVC
kubectl get pvc -n myapp
kubectl describe pvc -n myapp
# Check logs
kubectl logs -n myapp postgres-0
```
#### Data Persistence
```bash
# Verify PVC is bound
kubectl get pvc -n myapp
# Check storage class
kubectl get storageclass
# Resize PVC (if supported)
kubectl edit pvc postgres-data-postgres-0 -n myapp
# Change: storage: 10Gi (from 5Gi)
```
### Kyverno Policy Issues
#### Policy Violations
```bash
# List policies
kubectl get clusterpolicy
# Check policy reports
kubectl get policyreport --all-namespaces
# View specific policy
kubectl describe clusterpolicy secret-cloner
```
#### Secret Not Cloned
```bash
# Check if secret has label
kubectl get secret -n secrets --show-labels
# Check Kyverno logs
kubectl logs -n kyverno deployment/kyverno
# Manually trigger by recreating namespace
kubectl delete ns test-ns
kubectl create ns test-ns
```
### ArgoCD Issues
#### ArgoCD UI Not Accessible
```bash
# Check ArgoCD pods
kubectl get pods -n argocd
# Restart ArgoCD server
kubectl rollout restart deployment argocd-server -n argocd
# Port forward
kubectl port-forward svc/argocd-server -n argocd 8080:443
```
#### Sync Takes Too Long
```bash
# Check application controller logs
kubectl logs -n argocd deployment/argocd-application-controller
# Increase timeout (in apps/myapp.yaml)
spec:
syncPolicy:
retry:
backoff:
maxDuration: 5m # Increase from 3m
```
---
## Disaster Recovery
### Backup Strategy
**Current State**: No automated backups
**What Needs Backup**:
- ❌ Cluster state (not backed up - recreate via GitOps)
- ❌ Persistent volumes (currently not critical)
- ✅ Git repositories (GitHub provides backup)
- ⚠️ Secrets (sealed secrets in Git, unseal keys need safekeeping)
### Cluster Rebuild
**Scenario**: Complete cluster failure
```bash
# 1. Provision new Kubernetes cluster
# 2. Configure kubectl
kubectl config use-context new-cluster
kubectl cluster-info
# 3. Bootstrap cluster
cd ~/dev/k8s/launchpad
./bootstrap.sh
# 4. Wait for ArgoCD to sync all applications
kubectl get applications -n argocd -w
# 5. Recreate any unsealed secrets (from password manager)
# 6. Configure DNS for new cluster IPs
# 7. Verify all applications are healthy
```
**Time Estimate**: 30-60 minutes
**Data Loss**:
- Ephemeral data: Lost
- Database data: Lost (no backups currently)
- Configuration: No loss (in Git)
### Future Backup Plan
**Recommended**:
1. **Velero** for cluster backups
```bash
helm install velero vmware-tanzu/velero \
--namespace velero \
--create-namespace \
--set configuration.provider=aws \
--set configuration.backupStorageLocation[0].bucket=cluster-backups
```
2. **PostgreSQL backups** via CronJob
```yaml
# pg-backup-cronjob.yaml
kind: CronJob
spec:
schedule: "0 2 * * *" # Daily at 2am
jobTemplate:
spec:
template:
spec:
containers:
- name: pg-dump
image: postgres:16-alpine
command:
- /bin/sh
- -c
- pg_dump -U $DB_USER -d $DB_NAME > /backup/dump-$(date +%Y%m%d).sql
```
3. **Sealed Secrets private key backup**
```bash
# Backup sealed-secrets controller private key
kubectl get secret -n kube-system sealed-secrets-key \
-o yaml > sealed-secrets-key-backup.yaml
# Store in secure location (password manager, vault)
```
---
## Maintenance Procedures
### Upgrading ArgoCD
```bash
# Check current version
kubectl get deployment argocd-server -n argocd \
-o jsonpath='{.spec.template.spec.containers[0].image}'
# Update version in values
vim infra/values/argocd-values.yaml
# Or upgrade via Helm directly
helm upgrade argocd argo-cd \
--repo https://argoproj.github.io/argo-helm \
--namespace argocd \
--values infra/values/argocd-values.yaml \
--version 6.0.0 # New version
# Verify
kubectl get pods -n argocd
```
### Upgrading Kubernetes Version
```bash
# UpCloud: Upgrade via control panel or CLI
# After upgrade, verify cluster
kubectl version
kubectl get nodes
# Check for deprecated APIs
kubectl api-resources
# Update any deprecated resources in Git
```
### Rotating TLS Certificates
Let's Encrypt certificates auto-renew, but if manual rotation is needed:
```bash
# Delete certificate to force renewal
kubectl delete certificate myapp-tls -n myapp
# Cert-manager will automatically recreate
kubectl get certificate -n myapp -w
```
### Cleaning Up Old Resources
```bash
# List all namespaces
kubectl get namespaces
# Remove unused namespaces
kubectl delete namespace old-app
# Clean up ArgoCD applications
kubectl get applications -n argocd
kubectl delete application old-app -n argocd
# Clean up old Docker images (on nodes)
# SSH to nodes and run:
docker image prune -a --filter "until=720h" # 30 days
```
### DNS Management
**Adding New Subdomain**:
1. Add DNS A record pointing to Traefik LoadBalancer IP
```bash
# Get LoadBalancer IP
kubectl get svc -n traefik traefik -o jsonpath='{.status.loadBalancer.ingress[0].ip}'
```
2. Add to DNS provider:
```
myapp.forteapps.net A <LoadBalancer-IP>
```
3. Verify DNS propagation:
```bash
nslookup myapp.forteapps.net
dig myapp.forteapps.net
```
### Monitoring Resource Usage
```bash
# Node resource usage
kubectl top nodes
# Pod resource usage
kubectl top pods --all-namespaces
# Identify resource hogs
kubectl top pods --all-namespaces --sort-by=memory
kubectl top pods --all-namespaces --sort-by=cpu
```
---
## Advanced Operations
### Adding a New Infrastructure Component
Example: Adding Redis
```bash
# 1. Create application manifest
cat > infra/redis-application.yaml <<EOF
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: redis
namespace: argocd
annotations:
argocd.argoproj.io/sync-wave: "1"
spec:
project: default
source:
repoURL: https://charts.bitnami.com/bitnami
chart: redis
targetRevision: 18.0.0
helm:
values: |
auth:
enabled: true
password: changeme
destination:
server: https://kubernetes.default.svc
namespace: redis
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
EOF
# 2. Commit and push
git add infra/redis-application.yaml
git commit -m "Add Redis infrastructure component"
git push
# 3. ArgoCD will auto-sync within 60 seconds
```
### Multi-Cluster Setup (Future)
For multi-cluster deployments:
```yaml
# Different destinations per environment
# dev-cluster
destination:
server: https://dev.k8s.example.com
namespace: myapp
# prod-cluster
destination:
server: https://prod.k8s.example.com
namespace: myapp
```
### Blue-Green Deployments
```bash
# Deploy blue version
helm install myapp-blue forteapp \
--set app.image.tag=v1.0.0
# Deploy green version
helm install myapp-green forteapp \
--set app.image.tag=v2.0.0
# Switch traffic via IngressRoute
kubectl patch ingressroute myapp -n myapp --type merge \
-p '{"spec":{"routes":[{"services":[{"name":"myapp-green"}]}]}}'
# Remove blue deployment after validation
helm uninstall myapp-blue
```
---
## Emergency Procedures
### Emergency Rollback
```bash
# Immediate rollback
kubectl rollout undo deployment myapp -n myapp
# Update Git to make permanent
cd ~/dev/k8s/helm-prod-values
git revert HEAD
git push
```
### Emergency Scale Down
```bash
# Scale to zero (maintenance mode)
kubectl scale deployment myapp -n myapp --replicas=0
# Update Git
vim helm-values/myapp/values.yaml
# Set replicaCount: 0
git commit -am "Scale down myapp for maintenance"
git push
```
### Emergency Application Removal
```bash
# Remove application but keep data
kubectl patch application myapp -n argocd \
-p '{"metadata":{"finalizers":[]}}' --type merge
kubectl delete application myapp -n argocd
# Resources remain in cluster
```
---
## Useful Scripts
### Sync All Applications
```bash
#!/bin/bash
# sync-all.sh
for app in $(kubectl get applications -n argocd -o name); do
kubectl patch $app -n argocd \
--type merge \
-p '{"metadata":{"annotations":{"argocd.argoproj.io/refresh":"hard"}}}'
done
```
### Check All Applications Health
```bash
#!/bin/bash
# health-check.sh
kubectl get applications -n argocd \
-o custom-columns=\
NAME:.metadata.name,\
SYNC:.status.sync.status,\
HEALTH:.status.health.status,\
MESSAGE:.status.health.message
```
### Seal Secret Helper
```bash
#!/bin/bash
# seal-secret.sh
NAMESPACE=${1:-default}
SECRET_FILE=${2:-private/secret.yaml}
OUTPUT_FILE=${3:-secrets/secret-sealed.yaml}
kubeseal --format=yaml \
--cert=pub-cert.pem \
--namespace=$NAMESPACE \
< $SECRET_FILE \
> $OUTPUT_FILE
echo "Sealed secret created: $OUTPUT_FILE"
echo "Remember to delete: $SECRET_FILE"
```
---
## Checklist Templates
### New Application Deployment Checklist
- [ ] Application code repository created
- [ ] Dockerfile created and tested
- [ ] GitHub Actions workflow configured
- [ ] Helm values created in `helm-prod-values/`
- [ ] ArgoCD application manifest created in `apps/`
- [ ] Secrets created and sealed
- [ ] DNS record added for domain
- [ ] Application synced successfully
- [ ] Health check passed
- [ ] Slack notification received
- [ ] Application accessible via domain
- [ ] Monitoring configured
- [ ] Documentation updated
### Incident Response Checklist
- [ ] Incident identified (Slack alert, monitoring)
- [ ] Severity assessed
- [ ] Incident channel created
- [ ] Initial investigation (logs, metrics, events)
- [ ] Root cause identified
- [ ] Mitigation applied
- [ ] Verification of fix
- [ ] Post-mortem scheduled
- [ ] Documentation updated
---
**Last Updated**: 2026-03-16
**Maintained By**: Platform Team
**Emergency Contact**: #platform-support on Slack