Files
launchpad/docs/OPERATIONS-RUNBOOK.md
Danijel Simeunovic 03a0d7c9ae
Some checks failed
Deploy Gitea Pages / build-and-deploy (push) Failing after 5s
feature/multicluster
Co-authored-by: Danijel Simeunovic <danijel.simeunovic@trumf.no>
Reviewed-on: #4
Reviewed-by: gitea_admin <admin@forteapps.net>
2026-04-18 18:14:00 +00:00

1666 lines
38 KiB
Markdown

# Operations Runbook
## Table of Contents
- [Overview](#overview)
- [Cluster Bootstrap](#cluster-bootstrap)
- [Initial Cluster Setup](#initial-cluster-setup)
- [ArgoCD Repository Access Setup](#argocd-repository-access-setup)
- [Day-to-Day Operations](#day-to-day-operations)
- [Application Management](#application-management)
- [Secret Management](#secret-management)
- [Monitoring & Alerting](#monitoring--alerting)
- [Troubleshooting](#troubleshooting)
- [Disaster Recovery](#disaster-recovery)
- [Maintenance Procedures](#maintenance-procedures)
---
## Overview
This runbook provides operational procedures for maintaining the Kubernetes cluster and managing applications. It's intended for platform engineers and operators with full cluster access.
### Operator Prerequisites
- ✅ Full kubectl access to cluster
- ✅ Write access to all Git repositories
- ✅ ArgoCD UI access
- ✅ Slack notifications configured
- ✅ Understanding of Kubernetes concepts
---
## Cluster Bootstrap
### Initial Cluster Setup
Bootstrap a new cluster from scratch:
#### Prerequisites
1. **Kubernetes cluster running** (UpCloud or any K8s cluster)
2. **kubectl configured** with admin access
3. **Repositories cloned** locally
```bash
# Verify cluster access
kubectl cluster-info
kubectl get nodes
```
#### Bootstrap Procedure
```bash
# 1. Clone config repository
git clone https://git.forteapps.net/Forte/launchpad
cd launchpad
# 2. Set cluster name (optional)
export CLUSTER_NAME="prod-cluster-01"
# 3. Run bootstrap script
./bootstrap.sh
```
**What Happens:**
1. ✅ Installs ArgoCD via Helm
2. ✅ Configures ArgoCD with custom values
3. ✅ Applies root App-of-Apps manifest
4. ✅ ArgoCD automatically syncs all applications
5. ✅ Infrastructure and apps deploy in waves
#### Verify Bootstrap
```bash
# Wait for ArgoCD to be ready
kubectl wait --for=condition=available --timeout=300s \
deployment/argocd-server -n argocd
# Check ArgoCD applications
kubectl get applications -n argocd
# Expected output: infrastructure-apps, enterprise-apps, and all child apps
```
#### Post-Bootstrap Steps
1. **Configure DNS** for ingress domains:
- `argocd.127.0.0.1.nip.io` (local dev)
- `*.forteapps.net` (production)
2. **Verify Let's Encrypt certificates**:
```bash
kubectl get certificate --all-namespaces
kubectl get clusterissuer
```
3. **Check Kyverno policies**:
```bash
kubectl get clusterpolicy
```
4. **Verify monitoring stack**:
```bash
kubectl get pods -n monitoring
```
5. **Test Slack notifications** by triggering a sync
### ArgoCD Repository Access Setup
ArgoCD needs SSH access to private Git repositories to pull manifests and Helm values. This section covers setting up deploy keys for GitHub repositories.
#### Why Deploy Keys?
- **Read-only access**: Deploy keys provide secure, read-only access to repositories
- **No user credentials**: No need to share personal SSH keys or tokens
- **Repository-specific**: Each repository gets its own key for better security
- **Revocable**: Easy to revoke access without affecting other repositories
#### Prerequisites
- kubectl access to the cluster
- Write access to the GitHub repository
- ArgoCD installed and running
#### Setup Procedure
**Step 1: Generate SSH Key Pair**
Generate a dedicated SSH key for ArgoCD without a passphrase (required for automated access):
```bash
# Generate ED25519 key (recommended - smaller and more secure)
ssh-keygen -t ed25519 -C "argocd-deploy-key-launchpad" -f argocd-deploy-key -N ""
# Or RSA key if ED25519 is not supported
ssh-keygen -t rsa -b 4096 -C "argocd-deploy-key-launchpad" -f argocd-deploy-key -N ""
```
This creates two files:
- `argocd-deploy-key` - Private key (keep secret)
- `argocd-deploy-key.pub` - Public key (add to GitHub)
**Step 2: Add Public Key to GitHub**
1. Copy the public key:
```bash
cat argocd-deploy-key.pub
```
2. Go to GitHub repository settings:
- Navigate to: `https://git.forteapps.net/Forte/launchpad/settings/keys`
- Or: Repository → Settings → Deploy keys
3. Click **"Add deploy key"**
- Title: `ArgoCD Production Cluster`
- Key: Paste the public key content
- ☐ Allow write access (leave unchecked - read-only is sufficient)
- Click **"Add key"**
4. Repeat for the `helm-values` repository if it's private:
```bash
# Generate separate key for helm-values repo
ssh-keygen -t ed25519 -C "argocd-deploy-key-helm-values" -f argocd-helm-values-key -N ""
# Add to: https://github.com/fortedigital/helm-values/settings/keys
```
**Step 3: Create Kubernetes Secret**
Add the private key to ArgoCD as a repository secret:
Save the following file in private/ (gitignored) folder as secret.yaml
```bash
apiVersion: v1
kind: Secret
metadata:
name: forte-helm-repo
namespace: argocd
labels:
argocd.argoproj.io/secret-type: repository
stringData:
type: git
url: ssh://git@git.forteapps.net:2222/Forte/forte-helm.git
sshPrivateKey: |
<paste your private key here>
project: default
```
Seal the secret using `kubeseal` command
```bash
kubeseal --format=yaml \
--namespace=argocd \
< private/secret.yaml \
> secrets/forte-helm-repo-secret-sealed.yaml
```
**Step 4: Register Repository in ArgoCD**
Check in secrets/forte-helm-repo-secret-sealed.yaml and let Argo sync and create the secret.
**Step 5: Verify Repository Access**
```bash
# Check if repository is connected
kubectl get secrets -n argocd -l argocd.argoproj.io/secret-type=repository
# Verify connection in ArgoCD UI
# Settings → Repositories → Should show "Successful" status
# Test by creating an application
kubectl apply -f _app-of-apps-upc-dev.yaml # or _app-of-apps-upc-prod.yaml
# Check application sync status
kubectl get applications -n argocd
```
#### Testing Repository Access
Create a test application to verify SSH access:
```bash
cat > /tmp/test-repo-access.yaml <<EOF
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: test-repo-access
namespace: argocd
spec:
project: default
source:
repoURL: ssh://git@git.forteapps.net:2222/Forte/launchpad.git
targetRevision: main
path: cluster-resources
destination:
server: https://kubernetes.default.svc
namespace: default
syncPolicy:
automated: null # Manual sync for testing
EOF
kubectl apply -f /tmp/test-repo-access.yaml
# Check if ArgoCD can access the repository
kubectl describe application test-repo-access -n argocd
# Look for sync status - should show repository contents
kubectl get application test-repo-access -n argocd -o jsonpath='{.status.sync.status}'
# Clean up test application
kubectl delete application test-repo-access -n argocd
rm /tmp/test-repo-access.yaml
```
#### Security Best Practices
1. **Secure Private Keys**
```bash
# Store private key securely and delete local copy
# Option 1: Store in password manager (recommended)
# Option 2: Backup to encrypted storage
# Delete local private key after adding to Kubernetes
shred -u argocd-deploy-key
# Or on Windows
# Remove-Item -Path argocd-deploy-key -Force
```
2. **Rotate Keys Regularly**
```bash
# Generate new key
ssh-keygen -t ed25519 -C "argocd-deploy-key-$(date +%Y%m)" -f argocd-new-key -N ""
# Add new public key to GitHub (keep old key for now)
# Update Kubernetes secret
kubectl create secret generic repo-launchpad \
--from-file=sshPrivateKey=argocd-new-key \
--namespace=argocd \
--dry-run=client -o yaml | kubectl apply -f -
# Test access, then remove old deploy key from GitHub
# Clean up
shred -u argocd-new-key
```
3. **Audit Repository Access**
```bash
# List all repository secrets
kubectl get secrets -n argocd -l argocd.argoproj.io/secret-type=repository
# Review deploy keys in GitHub
# Visit: https://git.forteapps.net/Forte/launchpad/settings/keys
```
4. **Use Different Keys per Repository**
- Don't reuse the same deploy key across repositories
- If one key is compromised, only one repository is affected
- Easier to track and audit access
#### Troubleshooting Repository Access
**Issue: "permission denied (publickey)"**
```bash
# Check if secret exists
kubectl get secret repo-launchpad -n argocd
# Verify secret has correct label
kubectl get secret repo-launchpad -n argocd -o yaml | grep argocd.argoproj.io/secret-type
# Check ArgoCD application controller logs
kubectl logs -n argocd deployment/argocd-application-controller | grep -i "permission denied"
# Verify deploy key is added to GitHub
# Visit: https://git.forteapps.net/Forte/launchpad/settings/keys
```
**Issue: "Host key verification failed"**
```bash
# Add GitHub to known_hosts
kubectl exec -n argocd deployment/argocd-repo-server -- \
ssh-keyscan github.com >> ~/.ssh/known_hosts
# Or disable strict host key checking (less secure)
kubectl patch secret repo-launchpad -n argocd \
--type merge \
-p '{"stringData":{"insecure":"true"}}'
```
**Issue: Repository shows as "Unknown" status**
```bash
# Check repository server logs
kubectl logs -n argocd deployment/argocd-repo-server
# Refresh repository connection
kubectl delete secret repo-launchpad -n argocd
# Recreate secret (see Step 3 above)
# Restart ArgoCD components
kubectl rollout restart deployment argocd-repo-server -n argocd
kubectl rollout restart deployment argocd-application-controller -n argocd
```
#### Multiple Repository Setup
For the three-repository pattern (launchpad, forte-helm, helm-values):
```bash
# 1. launchpad (main config repo)
ssh-keygen -t ed25519 -C "argocd-launchpad" -f key-sturdy -N ""
# Add key-sturdy.pub to: https://git.forteapps.net/Forte/launchpad/settings/keys
# 2. helm-values (private values repo)
ssh-keygen -t ed25519 -C "argocd-helm-values" -f key-helm-values -N ""
# Add key-helm-values.pub to: https://github.com/fortedigital/helm-values/settings/keys
# 3. forte-helm (private helm charts repo)
# Create secrets
kubectl create secret generic repo-launchpad \
--from-file=sshPrivateKey=key-sturdy \
--namespace=argocd --dry-run=client -o yaml | \
kubectl label --local -f - argocd.argoproj.io/secret-type=repository --dry-run=client -o yaml | \
kubectl apply -f -
kubectl create secret generic repo-helm-values \
--from-file=sshPrivateKey=key-helm-values \
--namespace=argocd --dry-run=client -o yaml | \
kubectl label --local -f - argocd.argoproj.io/secret-type=repository --dry-run=client -o yaml | \
kubectl apply -f -
# Clean up keys
shred -u key-sturdy key-helm-values
```
#### Converting HTTPS to SSH
If you're currently using HTTPS and want to switch to SSH:
```bash
# 1. Generate and add deploy key (see steps above)
# 2. Update all Application manifests
# Change from:
# repoURL: https://git.forteapps.net/Forte/launchpad
# To:
# repoURL: ssh://git@git.forteapps.net:2222/Forte/launchpad.git
# 3. Update and commit
find . -name "*.yaml" -type f -exec sed -i 's|https://github.com/fortedigital/|git@github.com:fortedigital/|g' {} +
git add .
git commit -m "Switch from HTTPS to SSH for repository access"
git push
# 4. ArgoCD will automatically re-sync with new SSH URLs
```
---
## Day-to-Day Operations
### Monitoring ArgoCD Sync Status
#### Via Slack
All applications send notifications to shared Slack channel:
- ✅ `on-sync-succeeded` - Deployment succeeded
- ❌ `on-sync-failed` - Deployment failed
- ⚠️ `on-degraded` - Application unhealthy
#### Via CLI
```bash
# List all applications
kubectl get applications -n argocd
# Watch application status
kubectl get applications -n argocd -w
# Get detailed status
kubectl describe application myapp -n argocd
```
#### Via ArgoCD UI
```bash
# Port forward to UI
kubectl port-forward svc/argocd-server -n argocd 8080:443
# Access: https://localhost:8080
# No login required (insecure mode for internal use)
```
### Checking Application Health
```bash
# Quick health check for all apps
kubectl get applications -n argocd \
-o custom-columns=NAME:.metadata.name,SYNC:.status.sync.status,HEALTH:.status.health.status
# Expected output:
# NAME SYNC HEALTH
# infrastructure-apps Synced Healthy
# enterprise-apps Synced Healthy
# mcp10x Synced Healthy
# musicman Synced Healthy
```
### Manual Sync
Force sync an application:
```bash
# Trigger sync
kubectl patch application myapp -n argocd \
--type merge \
-p '{"metadata":{"annotations":{"argocd.argoproj.io/refresh":"hard"}}}'
# Or via ArgoCD CLI (if installed)
argocd app sync myapp
```
### Pausing Auto-Sync
Temporarily disable automatic syncing:
```bash
# Edit application
kubectl edit application myapp -n argocd
# Set automated to null
spec:
syncPolicy:
automated: null # Disable auto-sync
# Re-enable later
spec:
syncPolicy:
automated:
prune: true
selfHeal: true
```
---
## Application Management
### Deploying a New Application
See [Developer Guide](DEVELOPER-GUIDE.md#deploying-your-first-application) for detailed steps.
**Quick checklist:**
- [ ] Create `helm-values/myapp/values.yaml`
- [ ] Create `apps/myapp.yaml` in config repo
- [ ] Create SealedSecret if needed
- [ ] Commit and push changes
- [ ] Verify sync in Slack/ArgoCD
- [ ] Configure DNS for domain
- [ ] Test application accessibility
### Removing an Application
#### Safe Removal Procedure
```bash
# 1. Delete ArgoCD Application (with cascade)
kubectl delete application myapp -n argocd
# This will:
# - Remove application from ArgoCD
# - Delete all Kubernetes resources (cascade)
# - Remove namespace
# 2. Clean up Git repositories
cd ~/dev/k8s/launchpad
git rm apps/myapp.yaml
git commit -m "Remove myapp application"
git push
cd ~/dev/k8s/helm-prod-values
git rm -r myapp/
git commit -m "Remove myapp values"
git push
# 3. Remove sealed secrets (if any)
cd ~/dev/k8s/launchpad
git rm secrets/myapp-credentials-sealed.yaml
git commit -m "Remove myapp secrets"
git push
```
#### Removal Without Cascade
To remove from ArgoCD but keep resources running:
```bash
# Delete application with no cascade
kubectl patch application myapp -n argocd \
-p '{"metadata":{"finalizers":[]}}' --type merge
kubectl delete application myapp -n argocd
# Resources remain in cluster but are no longer managed
```
### Scaling Applications
#### Manual Scaling
```bash
# Scale deployment directly
kubectl scale deployment myapp -n myapp --replicas=3
# Note: If selfHeal is enabled, this will be reverted
```
#### GitOps Scaling
Update `helm-values/myapp/values.yaml`:
```yaml
app:
replicaCount: 3 # Change from 1 to 3
```
Commit and push - ArgoCD will sync.
#### Auto-Scaling (HPA)
Enable Horizontal Pod Autoscaler:
```yaml
# In helm-values/myapp/values.yaml
app:
hpa:
enabled: true
minReplicas: 2
maxReplicas: 10
targetCPUUtilizationPercentage: 70
```
**Note:** Remove `replicaCount` from ArgoCD ignore list if using HPA:
```yaml
# In apps/myapp.yaml
ignoreDifferences:
- group: apps
kind: Deployment
jsonPointers:
- /spec/replicas # Remove this line
```
### Rolling Back Deployments
#### Option 1: Git Revert
```bash
# Find the commit before the bad change
cd ~/dev/k8s/helm-prod-values
git log --oneline myapp/values.yaml
# Revert to previous version
git revert <commit-hash>
git push
# ArgoCD will sync the rollback
```
#### Option 2: Manual Rollback
```bash
# Rollback to previous revision
kubectl rollout undo deployment myapp -n myapp
# Note: This will be reverted by ArgoCD selfHeal
# Make permanent by updating Git
```
#### Option 3: Change Image Tag
```bash
# Edit helm-values
cd ~/dev/k8s/helm-prod-values
vim myapp/values.yaml
# Change image tag to previous version
app:
image:
tag: v1.0.0 # Roll back from v1.0.1
# Commit and push
git add myapp/values.yaml
git commit -m "Rollback myapp to v1.0.0"
git push
```
### Resource Updates
#### Update Resource Limits
```yaml
# In helm-values/myapp/values.yaml
app:
resources:
requests:
cpu: 200m # Increased from 100m
memory: 512Mi # Increased from 256Mi
limits:
cpu: 1000m
memory: 2Gi
```
#### Enable Database
```yaml
# In helm-values/myapp/values.yaml
db:
enabled: true
persistence:
size: 10Gi # Increase storage
```
---
## Secret Management
### Creating Secrets
#### Step 1: Get Public Certificate
```bash
# Fetch sealed-secrets public cert (one-time)
kubeseal --fetch-cert \
--controller-name=sealed-secrets-controller \
--controller-namespace=kube-system \
> pub-cert.pem
# Save this certificate for future use
```
#### Step 2: Create Plain Secret
```bash
# Method 1: From literal values
kubectl create secret generic myapp-credentials \
--from-literal=API_KEY=secret123 \
--from-literal=DB_PASSWORD=pass456 \
--namespace=myapp \
--dry-run=client -o yaml > private/myapp-credentials.yaml
# Method 2: From file
kubectl create secret generic myapp-credentials \
--from-file=.env \
--namespace=myapp \
--dry-run=client -o yaml > private/myapp-credentials.yaml
# Method 3: From multiple files
kubectl create secret generic myapp-credentials \
--from-file=api-key.txt \
--from-file=db-password.txt \
--namespace=myapp \
--dry-run=client -o yaml > private/myapp-credentials.yaml
```
#### Step 3: Seal Secret
```bash
kubeseal --format=yaml \
--cert=pub-cert.pem \
--namespace=myapp \
< private/myapp-credentials.yaml \
> secrets/myapp-credentials-sealed.yaml
```
#### Step 4: Commit Sealed Secret
```bash
git add secrets/myapp-credentials-sealed.yaml
git commit -m "Add myapp credentials"
git push
# Delete plain secret
rm private/myapp-credentials.yaml
```
### Updating Secrets
```bash
# 1. Create new version
kubectl create secret generic myapp-credentials \
--from-literal=API_KEY=new-secret-key \
--from-literal=DB_PASSWORD=new-password \
--namespace=myapp \
--dry-run=client -o yaml > private/myapp-credentials.yaml
# 2. Seal it
kubeseal --format=yaml \
--cert=pub-cert.pem \
--namespace=myapp \
< private/myapp-credentials.yaml \
> secrets/myapp-credentials-sealed.yaml
# 3. Commit
git add secrets/myapp-credentials-sealed.yaml
git commit -m "Update myapp credentials"
git push
# 4. Restart pods to pick up new secret
kubectl rollout restart deployment myapp -n myapp
# 5. Delete plain secret
rm private/myapp-credentials.yaml
```
### Viewing Secrets (Unsealed)
```bash
# List secrets in namespace
kubectl get secrets -n myapp
# Describe secret (doesn't show values)
kubectl describe secret myapp-credentials -n myapp
# View secret values (base64 encoded)
kubectl get secret myapp-credentials -n myapp -o yaml
# Decode secret value
kubectl get secret myapp-credentials -n myapp \
-o jsonpath='{.data.API_KEY}' | base64 -d
```
### Secret Cloning (Kyverno)
Secrets labeled `allowedToBeCloned: "true"` in the `secrets` namespace are automatically cloned to new namespaces.
```yaml
# Example: secrets-namespace.yaml
apiVersion: v1
kind: Secret
metadata:
name: shared-credentials
namespace: secrets
labels:
allowedToBeCloned: "true"
type: Opaque
data:
API_KEY: <base64-encoded-value>
```
When a new namespace is created, Kyverno automatically copies this secret.
### Authentication Secrets
Applications using the authentication sidecar require specific secrets depending on the auth mode.
#### Token Mode Secrets
Token-based auth uses an `auth-tokens` Secret:
```bash
# Method 1: From Helm values (automatic)
# Tokens specified in values.yaml are automatically created
# Method 2: Manual creation
kubectl create secret generic auth-tokens \
--from-literal=tokens="token1
token2
token3" \
--namespace=myapp
# Method 3: From file
echo "d4f88f6d9292c10cc3e21c4aad56d2be485db532b54fe961d738e1137d247823" > tokens.txt
echo "8803f621acc3898df1d7a8f514bc3602551a0681a8f747bd4e43c3c5849d57a7" >> tokens.txt
kubectl create secret generic auth-tokens \
--from-file=tokens=tokens.txt \
--namespace=myapp
rm tokens.txt
```
#### OIDC Mode Secrets
OIDC auth requires an `auth-oidc` Secret with two keys:
```bash
# Generate secrets
CLIENT_SECRET="your-oidc-client-secret-from-provider"
COOKIE_SECRET=$(openssl rand -hex 32)
# Create plain secret
kubectl create secret generic auth-oidc \
--from-literal=client-secret=$CLIENT_SECRET \
--from-literal=cookie-secret=$COOKIE_SECRET \
--namespace=myapp \
--dry-run=client -o yaml > private/myapp-auth-oidc.yaml
# Seal it
kubeseal --format=yaml \
--cert=pub-cert.pem \
--namespace=myapp \
< private/myapp-auth-oidc.yaml \
> secrets/myapp-auth-oidc-sealed.yaml
# Apply sealed secret
kubectl apply -f secrets/myapp-auth-oidc-sealed.yaml
# Commit to Git
git add secrets/myapp-auth-oidc-sealed.yaml
git commit -m "Add OIDC secrets for myapp"
git push
# Clean up
rm private/myapp-auth-oidc.yaml
```
#### Rotating Authentication Secrets
**Token Rotation**:
```bash
# Generate new token
NEW_TOKEN=$(openssl rand -hex 32)
# Get current tokens
kubectl get secret auth-tokens -n myapp -o yaml > /tmp/tokens.yaml
# Edit tokens (add new, optionally remove old)
# Then re-seal and apply
# Restart pods to use new tokens
kubectl rollout restart deployment myapp -n myapp
```
**OIDC Secret Rotation**:
```bash
# Rotate cookie secret (safe - invalidates existing sessions)
NEW_COOKIE_SECRET=$(openssl rand -hex 32)
# Recreate secret
kubectl create secret generic auth-oidc \
--from-literal=client-secret=$CLIENT_SECRET \
--from-literal=cookie-secret=$NEW_COOKIE_SECRET \
--namespace=myapp \
--dry-run=client -o yaml | \
kubeseal --format=yaml --cert=pub-cert.pem --namespace=myapp | \
kubectl apply -f -
# Restart to pick up new secret
kubectl rollout restart deployment myapp -n myapp
```
#### Viewing Authentication Secrets
```bash
# List auth-related secrets
kubectl get secrets -n myapp | grep auth
# View token secret (tokens are in plain text in the Secret)
kubectl get secret auth-tokens -n myapp -o jsonpath='{.data.tokens}' | base64 -d
# View OIDC secret keys (values are base64 encoded)
kubectl get secret auth-oidc -n myapp -o jsonpath='{.data.client-secret}' | base64 -d
kubectl get secret auth-oidc -n myapp -o jsonpath='{.data.cookie-secret}' | base64 -d
```
**See**: [Developer Guide - Enabling Authentication](../docs/DEVELOPER-GUIDE.md#enabling-authentication-for-applications) for complete authentication setup guide.
---
## Monitoring & Alerting
### Prometheus Metrics
```bash
# Port forward to Prometheus
kubectl port-forward -n monitoring svc/prometheus-server 9090:80
# Access: http://localhost:9090
```
**Common Queries:**
```promql
# CPU usage per pod
sum(rate(container_cpu_usage_seconds_total[5m])) by (pod)
# Memory usage per pod
sum(container_memory_usage_bytes) by (pod)
# Request rate per service
rate(http_requests_total[5m])
```
### Grafana Dashboards
```bash
# Port forward to Grafana
kubectl port-forward -n monitoring svc/grafana 3000:80
# Access: http://localhost:3000
```
### Loki Logs
```bash
# Port forward to Loki
kubectl port-forward -n monitoring svc/loki 3100:3100
# Query logs
curl -G -s 'http://localhost:3100/loki/api/v1/query_range' \
--data-urlencode 'query={namespace="myapp"}' \
--data-urlencode 'start=1h' | jq
```
### Tempo Traces
```bash
# Port forward to Tempo query API
kubectl port-forward -n monitoring svc/tempo 3200:3200
# Access: http://localhost:3200
```
**Query traces via Grafana:**
1. Open Grafana → Explore
2. Select Tempo datasource
3. Use TraceQL or search by service name
**Verify Traefik is sending traces:**
```bash
# Check Traefik logs for OTLP export errors
kubectl logs -n traefik-system -l app.kubernetes.io/name=traefik | grep -i "traces export"
# Check Tempo is receiving data
kubectl logs -n monitoring -l app.kubernetes.io/name=tempo | grep "receiver"
```
**Trace-to-log correlation:**
- Click a trace span in Grafana → linked Loki logs appear (by namespace, pod, container)
- Trace-to-metrics links to Prometheus by service name
### Fluent-Bit Log Shipping
Verify Fluent-Bit is shipping logs:
```bash
# Check Fluent-Bit pods
kubectl get pods -n monitoring | grep fluent-bit
# Check logs
kubectl logs -n monitoring daemonset/fluent-bit
# Verify Loki is receiving logs
kubectl logs -n monitoring deployment/loki | grep "POST /loki/api/v1/push"
```
### Trivy Vulnerability Scanning
```bash
# Check Trivy scan results
kubectl get vulnerabilityreports --all-namespaces
# View report for specific pod
kubectl describe vulnerabilityreport -n myapp <report-name>
```
### Slack Notifications
All applications have Slack notifications enabled:
```yaml
metadata:
annotations:
notifications.argoproj.io/subscribe.on-sync-succeeded.slack: ""
notifications.argoproj.io/subscribe.on-sync-failed.slack: ""
notifications.argoproj.io/subscribe.on-degraded.slack: ""
```
**Test Notification:**
```bash
# Trigger a sync to test
kubectl patch application myapp -n argocd \
--type merge \
-p '{"metadata":{"annotations":{"argocd.argoproj.io/refresh":"hard"}}}'
```
---
## Troubleshooting
### Application Won't Sync
#### Check Application Status
```bash
kubectl describe application myapp -n argocd
```
Look for errors in:
- `Status.Conditions`
- `Status.OperationState`
#### Common Issues
**Issue 1: Image Pull Error**
```bash
# Error: ErrImagePull, ImagePullBackOff
# Check if image exists
docker pull ghcr.io/fortedigital/myapp:v1.0.0
# Check image pull secrets
kubectl get secrets -n myapp | grep regcred
# Check pod events
kubectl describe pod -n myapp <pod-name>
```
**Issue 2: Invalid YAML**
```bash
# Error: unable to decode manifest
# Validate YAML locally
kubectl apply --dry-run=client -f apps/myapp.yaml
# Check ArgoCD application controller logs
kubectl logs -n argocd deployment/argocd-application-controller | grep myapp
```
**Issue 3: Resource Quota Exceeded**
```bash
# Error: exceeded quota
# Check namespace quotas
kubectl get resourcequota -n myapp
kubectl describe resourcequota -n myapp
# Increase quota or reduce resource requests
```
### Pod Crashes
#### CrashLoopBackOff
```bash
# Check pod status
kubectl get pods -n myapp
# View logs
kubectl logs -n myapp <pod-name>
kubectl logs -n myapp <pod-name> --previous # Previous container
# Check events
kubectl describe pod -n myapp <pod-name>
```
**Common Causes:**
- Application error (check logs)
- Missing environment variables
- Wrong port configuration
- Missing secrets
- Insufficient memory/CPU
#### ImagePullBackOff
```bash
# Check image name
kubectl get deployment myapp -n myapp -o yaml | grep image
# Verify credentials
kubectl get secret -n myapp
```
#### Pending
```bash
# Check why pod is pending
kubectl describe pod -n myapp <pod-name>
# Common reasons:
# - Insufficient resources on nodes
# - PVC not bound
# - Node selector doesn't match
```
### Ingress / TLS Issues
#### Application Not Accessible
```bash
# Check IngressRoute
kubectl get ingressroute -n myapp
kubectl describe ingressroute myapp -n myapp
# Check Traefik
kubectl get pods -n traefik
kubectl logs -n traefik deployment/traefik
# Test with port-forward
kubectl port-forward -n myapp service/myapp 8080:3000
curl http://localhost:8080
```
#### Certificate Issues
```bash
# Check certificates
kubectl get certificate -n myapp
kubectl describe certificate myapp-tls -n myapp
# Check cert-manager
kubectl get clusterissuer
kubectl logs -n cert-manager deployment/cert-manager
# Check Let's Encrypt challenges
kubectl get challenges --all-namespaces
```
**Manual Certificate Renewal:**
```bash
# Delete and recreate certificate
kubectl delete certificate myapp-tls -n myapp
# Certificate will be automatically recreated
```
### Database Issues
#### PostgreSQL Won't Start
```bash
# Check StatefulSet
kubectl get statefulset -n myapp
kubectl describe statefulset postgres -n myapp
# Check PVC
kubectl get pvc -n myapp
kubectl describe pvc -n myapp
# Check logs
kubectl logs -n myapp postgres-0
```
#### Data Persistence
```bash
# Verify PVC is bound
kubectl get pvc -n myapp
# Check storage class
kubectl get storageclass
# Resize PVC (if supported)
kubectl edit pvc postgres-data-postgres-0 -n myapp
# Change: storage: 10Gi (from 5Gi)
```
### Kyverno Policy Issues
#### Policy Violations
```bash
# List policies
kubectl get clusterpolicy
# Check policy reports
kubectl get policyreport --all-namespaces
# View specific policy
kubectl describe clusterpolicy secret-cloner
```
#### Secret Not Cloned
```bash
# Check if secret has label
kubectl get secret -n secrets --show-labels
# Check Kyverno logs
kubectl logs -n kyverno deployment/kyverno
# Manually trigger by recreating namespace
kubectl delete ns test-ns
kubectl create ns test-ns
```
### ArgoCD Issues
#### ArgoCD UI Not Accessible
```bash
# Check ArgoCD pods
kubectl get pods -n argocd
# Restart ArgoCD server
kubectl rollout restart deployment argocd-server -n argocd
# Port forward
kubectl port-forward svc/argocd-server -n argocd 8080:443
```
#### Sync Takes Too Long
```bash
# Check application controller logs
kubectl logs -n argocd deployment/argocd-application-controller
# Increase timeout (in apps/myapp.yaml)
spec:
syncPolicy:
retry:
backoff:
maxDuration: 5m # Increase from 3m
```
---
## Disaster Recovery
### Backup Strategy
**Current State**: No automated backups
**What Needs Backup**:
- ❌ Cluster state (not backed up - recreate via GitOps)
- ❌ Persistent volumes (currently not critical)
- ✅ Git repositories (GitHub provides backup)
- ⚠️ Secrets (sealed secrets in Git, unseal keys need safekeeping)
### Cluster Rebuild
**Scenario**: Complete cluster failure
```bash
# 1. Provision new Kubernetes cluster
# 2. Configure kubectl
kubectl config use-context new-cluster
kubectl cluster-info
# 3. Bootstrap cluster
cd ~/dev/k8s/launchpad
./bootstrap.sh
# 4. Wait for ArgoCD to sync all applications
kubectl get applications -n argocd -w
# 5. Recreate any unsealed secrets (from password manager)
# 6. Configure DNS for new cluster IPs
# 7. Verify all applications are healthy
```
**Time Estimate**: 30-60 minutes
**Data Loss**:
- Ephemeral data: Lost
- Database data: Lost (no backups currently)
- Configuration: No loss (in Git)
### Future Backup Plan
**Recommended**:
1. **Velero** for cluster backups
```bash
helm install velero vmware-tanzu/velero \
--namespace velero \
--create-namespace \
--set configuration.provider=aws \
--set configuration.backupStorageLocation[0].bucket=cluster-backups
```
2. **PostgreSQL backups** via CronJob
```yaml
# pg-backup-cronjob.yaml
kind: CronJob
spec:
schedule: "0 2 * * *" # Daily at 2am
jobTemplate:
spec:
template:
spec:
containers:
- name: pg-dump
image: postgres:16-alpine
command:
- /bin/sh
- -c
- pg_dump -U $DB_USER -d $DB_NAME > /backup/dump-$(date +%Y%m%d).sql
```
3. **Sealed Secrets private key backup**
```bash
# Backup sealed-secrets controller private key
kubectl get secret -n kube-system sealed-secrets-key \
-o yaml > sealed-secrets-key-backup.yaml
# Store in secure location (password manager, vault)
```
---
## Maintenance Procedures
### Upgrading ArgoCD
```bash
# Check current version
kubectl get deployment argocd-server -n argocd \
-o jsonpath='{.spec.template.spec.containers[0].image}'
# Update version in values
vim infra/values/base/argocd-values.yaml
# Or upgrade via Helm directly
helm upgrade argocd argo-cd \
--repo https://argoproj.github.io/argo-helm \
--namespace argocd \
--values infra/values/base/argocd-values.yaml \
--version 6.0.0 # New version
# Verify
kubectl get pods -n argocd
```
### Upgrading Kubernetes Version
```bash
# UpCloud: Upgrade via control panel or CLI
# After upgrade, verify cluster
kubectl version
kubectl get nodes
# Check for deprecated APIs
kubectl api-resources
# Update any deprecated resources in Git
```
### Rotating TLS Certificates
Let's Encrypt certificates auto-renew, but if manual rotation is needed:
```bash
# Delete certificate to force renewal
kubectl delete certificate myapp-tls -n myapp
# Cert-manager will automatically recreate
kubectl get certificate -n myapp -w
```
### Cleaning Up Old Resources
```bash
# List all namespaces
kubectl get namespaces
# Remove unused namespaces
kubectl delete namespace old-app
# Clean up ArgoCD applications
kubectl get applications -n argocd
kubectl delete application old-app -n argocd
# Clean up old Docker images (on nodes)
# SSH to nodes and run:
docker image prune -a --filter "until=720h" # 30 days
```
### DNS Management
**Adding New Subdomain**:
1. Add DNS A record pointing to Traefik LoadBalancer IP
```bash
# Get LoadBalancer IP
kubectl get svc -n traefik traefik -o jsonpath='{.status.loadBalancer.ingress[0].ip}'
```
2. Add to DNS provider:
```
myapp.forteapps.net A <LoadBalancer-IP>
```
3. Verify DNS propagation:
```bash
nslookup myapp.forteapps.net
dig myapp.forteapps.net
```
### Monitoring Resource Usage
```bash
# Node resource usage
kubectl top nodes
# Pod resource usage
kubectl top pods --all-namespaces
# Identify resource hogs
kubectl top pods --all-namespaces --sort-by=memory
kubectl top pods --all-namespaces --sort-by=cpu
```
---
## Advanced Operations
### Adding a New Infrastructure Component
Example: Adding Redis
```bash
# 1. Create application manifest in base/
cat > infra/base/redis-application.yaml <<EOF
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: redis
namespace: argocd
annotations:
argocd.argoproj.io/sync-wave: "1"
spec:
project: default
sources:
- repoURL: https://charts.bitnami.com/bitnami
chart: redis
targetRevision: 18.0.0
helm:
releaseName: redis
valueFiles:
- \$values/infra/values/base/redis-values.yaml
- repoURL: ssh://git@git.forteapps.net:2222/Forte/launchpad.git
targetRevision: HEAD
ref: values
destination:
server: https://kubernetes.default.svc
namespace: redis
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
EOF
# 2. Add to base kustomization
# Edit infra/base/kustomization.yaml and add: - redis-application.yaml
# 3. Create base values file
cat > infra/values/base/redis-values.yaml <<EOF
auth:
enabled: true
EOF
# 4. Commit and push
git add infra/base/redis-application.yaml infra/values/base/redis-values.yaml infra/base/kustomization.yaml
git commit -m "Add Redis infrastructure component"
git push
# 5. ArgoCD will auto-sync within 60 seconds
```
### Multi-Cluster Setup
The repository supports multiple clusters via Kustomize overlays:
- **upc-dev** (default): `infra/overlays/upc-dev/` — uses base Applications as-is
- **upc-prod**: `infra/overlays/upc-prod/` — patches value file paths from `upc-dev` to `upc-prod`
Each cluster has its own:
- Root app-of-apps file: `_app-of-apps-upc-dev.yaml` / `_app-of-apps-upc-prod.yaml`
- Cluster-specific Helm values: `infra/values/upc-dev/` / `infra/values/upc-prod/`
- Sealed secrets: `secrets/upc-dev/` (others as needed)
- Apps overlay: `apps/overlays/upc-dev/` / `apps/overlays/upc-prod/`
To add a new cluster, create a new overlay directory (e.g., `infra/overlays/upc-staging/`) with patches that swap the value file paths.
### Blue-Green Deployments
```bash
# Deploy blue version
helm install myapp-blue forteapp \
--set app.image.tag=v1.0.0
# Deploy green version
helm install myapp-green forteapp \
--set app.image.tag=v2.0.0
# Switch traffic via IngressRoute
kubectl patch ingressroute myapp -n myapp --type merge \
-p '{"spec":{"routes":[{"services":[{"name":"myapp-green"}]}]}}'
# Remove blue deployment after validation
helm uninstall myapp-blue
```
---
## Emergency Procedures
### Emergency Rollback
```bash
# Immediate rollback
kubectl rollout undo deployment myapp -n myapp
# Update Git to make permanent
cd ~/dev/k8s/helm-prod-values
git revert HEAD
git push
```
### Emergency Scale Down
```bash
# Scale to zero (maintenance mode)
kubectl scale deployment myapp -n myapp --replicas=0
# Update Git
vim helm-values/myapp/values.yaml
# Set replicaCount: 0
git commit -am "Scale down myapp for maintenance"
git push
```
### Emergency Application Removal
```bash
# Remove application but keep data
kubectl patch application myapp -n argocd \
-p '{"metadata":{"finalizers":[]}}' --type merge
kubectl delete application myapp -n argocd
# Resources remain in cluster
```
---
## Useful Scripts
### Sync All Applications
```bash
#!/bin/bash
# sync-all.sh
for app in $(kubectl get applications -n argocd -o name); do
kubectl patch $app -n argocd \
--type merge \
-p '{"metadata":{"annotations":{"argocd.argoproj.io/refresh":"hard"}}}'
done
```
### Check All Applications Health
```bash
#!/bin/bash
# health-check.sh
kubectl get applications -n argocd \
-o custom-columns=\
NAME:.metadata.name,\
SYNC:.status.sync.status,\
HEALTH:.status.health.status,\
MESSAGE:.status.health.message
```
### Seal Secret Helper
```bash
#!/bin/bash
# seal-secret.sh
NAMESPACE=${1:-default}
SECRET_FILE=${2:-private/secret.yaml}
OUTPUT_FILE=${3:-secrets/secret-sealed.yaml}
kubeseal --format=yaml \
--cert=pub-cert.pem \
--namespace=$NAMESPACE \
< $SECRET_FILE \
> $OUTPUT_FILE
echo "Sealed secret created: $OUTPUT_FILE"
echo "Remember to delete: $SECRET_FILE"
```
---
## Checklist Templates
### New Application Deployment Checklist
- [ ] Application code repository created
- [ ] Dockerfile created and tested
- [ ] GitHub Actions workflow configured
- [ ] Helm values created in `helm-prod-values/`
- [ ] ArgoCD application manifest created in `apps/`
- [ ] Secrets created and sealed
- [ ] DNS record added for domain
- [ ] Application synced successfully
- [ ] Health check passed
- [ ] Slack notification received
- [ ] Application accessible via domain
- [ ] Monitoring configured
- [ ] Documentation updated
### Incident Response Checklist
- [ ] Incident identified (Slack alert, monitoring)
- [ ] Severity assessed
- [ ] Incident channel created
- [ ] Initial investigation (logs, metrics, events)
- [ ] Root cause identified
- [ ] Mitigation applied
- [ ] Verification of fix
- [ ] Post-mortem scheduled
- [ ] Documentation updated
---
**Last Updated**: 2026-03-16
**Maintained By**: Platform Team
**Emergency Contact**: #platform-support on Slack