Some checks failed
Deploy Gitea Pages / build-and-deploy (push) Failing after 5s
Co-authored-by: Danijel Simeunovic <danijel.simeunovic@trumf.no> Reviewed-on: #4 Reviewed-by: gitea_admin <admin@forteapps.net>
1666 lines
38 KiB
Markdown
1666 lines
38 KiB
Markdown
# Operations Runbook
|
|
|
|
## Table of Contents
|
|
- [Overview](#overview)
|
|
- [Cluster Bootstrap](#cluster-bootstrap)
|
|
- [Initial Cluster Setup](#initial-cluster-setup)
|
|
- [ArgoCD Repository Access Setup](#argocd-repository-access-setup)
|
|
- [Day-to-Day Operations](#day-to-day-operations)
|
|
- [Application Management](#application-management)
|
|
- [Secret Management](#secret-management)
|
|
- [Monitoring & Alerting](#monitoring--alerting)
|
|
- [Troubleshooting](#troubleshooting)
|
|
- [Disaster Recovery](#disaster-recovery)
|
|
- [Maintenance Procedures](#maintenance-procedures)
|
|
|
|
---
|
|
|
|
## Overview
|
|
|
|
This runbook provides operational procedures for maintaining the Kubernetes cluster and managing applications. It's intended for platform engineers and operators with full cluster access.
|
|
|
|
### Operator Prerequisites
|
|
|
|
- ✅ Full kubectl access to cluster
|
|
- ✅ Write access to all Git repositories
|
|
- ✅ ArgoCD UI access
|
|
- ✅ Slack notifications configured
|
|
- ✅ Understanding of Kubernetes concepts
|
|
|
|
---
|
|
|
|
## Cluster Bootstrap
|
|
|
|
### Initial Cluster Setup
|
|
|
|
Bootstrap a new cluster from scratch:
|
|
|
|
#### Prerequisites
|
|
|
|
1. **Kubernetes cluster running** (UpCloud or any K8s cluster)
|
|
2. **kubectl configured** with admin access
|
|
3. **Repositories cloned** locally
|
|
|
|
```bash
|
|
# Verify cluster access
|
|
kubectl cluster-info
|
|
kubectl get nodes
|
|
```
|
|
|
|
#### Bootstrap Procedure
|
|
|
|
```bash
|
|
# 1. Clone config repository
|
|
git clone https://git.forteapps.net/Forte/launchpad
|
|
cd launchpad
|
|
|
|
# 2. Set cluster name (optional)
|
|
export CLUSTER_NAME="prod-cluster-01"
|
|
|
|
# 3. Run bootstrap script
|
|
./bootstrap.sh
|
|
```
|
|
|
|
**What Happens:**
|
|
1. ✅ Installs ArgoCD via Helm
|
|
2. ✅ Configures ArgoCD with custom values
|
|
3. ✅ Applies root App-of-Apps manifest
|
|
4. ✅ ArgoCD automatically syncs all applications
|
|
5. ✅ Infrastructure and apps deploy in waves
|
|
|
|
#### Verify Bootstrap
|
|
|
|
```bash
|
|
# Wait for ArgoCD to be ready
|
|
kubectl wait --for=condition=available --timeout=300s \
|
|
deployment/argocd-server -n argocd
|
|
|
|
# Check ArgoCD applications
|
|
kubectl get applications -n argocd
|
|
|
|
# Expected output: infrastructure-apps, enterprise-apps, and all child apps
|
|
```
|
|
|
|
#### Post-Bootstrap Steps
|
|
|
|
1. **Configure DNS** for ingress domains:
|
|
- `argocd.127.0.0.1.nip.io` (local dev)
|
|
- `*.forteapps.net` (production)
|
|
|
|
2. **Verify Let's Encrypt certificates**:
|
|
```bash
|
|
kubectl get certificate --all-namespaces
|
|
kubectl get clusterissuer
|
|
```
|
|
|
|
3. **Check Kyverno policies**:
|
|
```bash
|
|
kubectl get clusterpolicy
|
|
```
|
|
|
|
4. **Verify monitoring stack**:
|
|
```bash
|
|
kubectl get pods -n monitoring
|
|
```
|
|
|
|
5. **Test Slack notifications** by triggering a sync
|
|
|
|
### ArgoCD Repository Access Setup
|
|
|
|
ArgoCD needs SSH access to private Git repositories to pull manifests and Helm values. This section covers setting up deploy keys for GitHub repositories.
|
|
|
|
#### Why Deploy Keys?
|
|
|
|
- **Read-only access**: Deploy keys provide secure, read-only access to repositories
|
|
- **No user credentials**: No need to share personal SSH keys or tokens
|
|
- **Repository-specific**: Each repository gets its own key for better security
|
|
- **Revocable**: Easy to revoke access without affecting other repositories
|
|
|
|
#### Prerequisites
|
|
|
|
- kubectl access to the cluster
|
|
- Write access to the GitHub repository
|
|
- ArgoCD installed and running
|
|
|
|
#### Setup Procedure
|
|
|
|
**Step 1: Generate SSH Key Pair**
|
|
|
|
Generate a dedicated SSH key for ArgoCD without a passphrase (required for automated access):
|
|
|
|
```bash
|
|
# Generate ED25519 key (recommended - smaller and more secure)
|
|
ssh-keygen -t ed25519 -C "argocd-deploy-key-launchpad" -f argocd-deploy-key -N ""
|
|
|
|
# Or RSA key if ED25519 is not supported
|
|
ssh-keygen -t rsa -b 4096 -C "argocd-deploy-key-launchpad" -f argocd-deploy-key -N ""
|
|
```
|
|
|
|
This creates two files:
|
|
- `argocd-deploy-key` - Private key (keep secret)
|
|
- `argocd-deploy-key.pub` - Public key (add to GitHub)
|
|
|
|
**Step 2: Add Public Key to GitHub**
|
|
|
|
1. Copy the public key:
|
|
```bash
|
|
cat argocd-deploy-key.pub
|
|
```
|
|
|
|
2. Go to GitHub repository settings:
|
|
- Navigate to: `https://git.forteapps.net/Forte/launchpad/settings/keys`
|
|
- Or: Repository → Settings → Deploy keys
|
|
|
|
3. Click **"Add deploy key"**
|
|
- Title: `ArgoCD Production Cluster`
|
|
- Key: Paste the public key content
|
|
- ☐ Allow write access (leave unchecked - read-only is sufficient)
|
|
- Click **"Add key"**
|
|
|
|
4. Repeat for the `helm-values` repository if it's private:
|
|
```bash
|
|
# Generate separate key for helm-values repo
|
|
ssh-keygen -t ed25519 -C "argocd-deploy-key-helm-values" -f argocd-helm-values-key -N ""
|
|
|
|
# Add to: https://github.com/fortedigital/helm-values/settings/keys
|
|
```
|
|
|
|
**Step 3: Create Kubernetes Secret**
|
|
|
|
Add the private key to ArgoCD as a repository secret:
|
|
|
|
Save the following file in private/ (gitignored) folder as secret.yaml
|
|
```bash
|
|
apiVersion: v1
|
|
kind: Secret
|
|
metadata:
|
|
name: forte-helm-repo
|
|
namespace: argocd
|
|
labels:
|
|
argocd.argoproj.io/secret-type: repository
|
|
stringData:
|
|
type: git
|
|
url: ssh://git@git.forteapps.net:2222/Forte/forte-helm.git
|
|
sshPrivateKey: |
|
|
<paste your private key here>
|
|
project: default
|
|
```
|
|
Seal the secret using `kubeseal` command
|
|
```bash
|
|
kubeseal --format=yaml \
|
|
--namespace=argocd \
|
|
< private/secret.yaml \
|
|
> secrets/forte-helm-repo-secret-sealed.yaml
|
|
```
|
|
|
|
**Step 4: Register Repository in ArgoCD**
|
|
|
|
Check in secrets/forte-helm-repo-secret-sealed.yaml and let Argo sync and create the secret.
|
|
|
|
**Step 5: Verify Repository Access**
|
|
|
|
```bash
|
|
# Check if repository is connected
|
|
kubectl get secrets -n argocd -l argocd.argoproj.io/secret-type=repository
|
|
|
|
# Verify connection in ArgoCD UI
|
|
# Settings → Repositories → Should show "Successful" status
|
|
|
|
# Test by creating an application
|
|
kubectl apply -f _app-of-apps-upc-dev.yaml # or _app-of-apps-upc-prod.yaml
|
|
|
|
# Check application sync status
|
|
kubectl get applications -n argocd
|
|
```
|
|
|
|
#### Testing Repository Access
|
|
|
|
Create a test application to verify SSH access:
|
|
|
|
```bash
|
|
cat > /tmp/test-repo-access.yaml <<EOF
|
|
apiVersion: argoproj.io/v1alpha1
|
|
kind: Application
|
|
metadata:
|
|
name: test-repo-access
|
|
namespace: argocd
|
|
spec:
|
|
project: default
|
|
source:
|
|
repoURL: ssh://git@git.forteapps.net:2222/Forte/launchpad.git
|
|
targetRevision: main
|
|
path: cluster-resources
|
|
destination:
|
|
server: https://kubernetes.default.svc
|
|
namespace: default
|
|
syncPolicy:
|
|
automated: null # Manual sync for testing
|
|
EOF
|
|
|
|
kubectl apply -f /tmp/test-repo-access.yaml
|
|
|
|
# Check if ArgoCD can access the repository
|
|
kubectl describe application test-repo-access -n argocd
|
|
|
|
# Look for sync status - should show repository contents
|
|
kubectl get application test-repo-access -n argocd -o jsonpath='{.status.sync.status}'
|
|
|
|
# Clean up test application
|
|
kubectl delete application test-repo-access -n argocd
|
|
rm /tmp/test-repo-access.yaml
|
|
```
|
|
|
|
#### Security Best Practices
|
|
|
|
1. **Secure Private Keys**
|
|
```bash
|
|
# Store private key securely and delete local copy
|
|
# Option 1: Store in password manager (recommended)
|
|
# Option 2: Backup to encrypted storage
|
|
|
|
# Delete local private key after adding to Kubernetes
|
|
shred -u argocd-deploy-key
|
|
|
|
# Or on Windows
|
|
# Remove-Item -Path argocd-deploy-key -Force
|
|
```
|
|
|
|
2. **Rotate Keys Regularly**
|
|
```bash
|
|
# Generate new key
|
|
ssh-keygen -t ed25519 -C "argocd-deploy-key-$(date +%Y%m)" -f argocd-new-key -N ""
|
|
|
|
# Add new public key to GitHub (keep old key for now)
|
|
|
|
# Update Kubernetes secret
|
|
kubectl create secret generic repo-launchpad \
|
|
--from-file=sshPrivateKey=argocd-new-key \
|
|
--namespace=argocd \
|
|
--dry-run=client -o yaml | kubectl apply -f -
|
|
|
|
# Test access, then remove old deploy key from GitHub
|
|
|
|
# Clean up
|
|
shred -u argocd-new-key
|
|
```
|
|
|
|
3. **Audit Repository Access**
|
|
```bash
|
|
# List all repository secrets
|
|
kubectl get secrets -n argocd -l argocd.argoproj.io/secret-type=repository
|
|
|
|
# Review deploy keys in GitHub
|
|
# Visit: https://git.forteapps.net/Forte/launchpad/settings/keys
|
|
```
|
|
|
|
4. **Use Different Keys per Repository**
|
|
- Don't reuse the same deploy key across repositories
|
|
- If one key is compromised, only one repository is affected
|
|
- Easier to track and audit access
|
|
|
|
#### Troubleshooting Repository Access
|
|
|
|
**Issue: "permission denied (publickey)"**
|
|
|
|
```bash
|
|
# Check if secret exists
|
|
kubectl get secret repo-launchpad -n argocd
|
|
|
|
# Verify secret has correct label
|
|
kubectl get secret repo-launchpad -n argocd -o yaml | grep argocd.argoproj.io/secret-type
|
|
|
|
# Check ArgoCD application controller logs
|
|
kubectl logs -n argocd deployment/argocd-application-controller | grep -i "permission denied"
|
|
|
|
# Verify deploy key is added to GitHub
|
|
# Visit: https://git.forteapps.net/Forte/launchpad/settings/keys
|
|
```
|
|
|
|
**Issue: "Host key verification failed"**
|
|
|
|
```bash
|
|
# Add GitHub to known_hosts
|
|
kubectl exec -n argocd deployment/argocd-repo-server -- \
|
|
ssh-keyscan github.com >> ~/.ssh/known_hosts
|
|
|
|
# Or disable strict host key checking (less secure)
|
|
kubectl patch secret repo-launchpad -n argocd \
|
|
--type merge \
|
|
-p '{"stringData":{"insecure":"true"}}'
|
|
```
|
|
|
|
**Issue: Repository shows as "Unknown" status**
|
|
|
|
```bash
|
|
# Check repository server logs
|
|
kubectl logs -n argocd deployment/argocd-repo-server
|
|
|
|
# Refresh repository connection
|
|
kubectl delete secret repo-launchpad -n argocd
|
|
# Recreate secret (see Step 3 above)
|
|
|
|
# Restart ArgoCD components
|
|
kubectl rollout restart deployment argocd-repo-server -n argocd
|
|
kubectl rollout restart deployment argocd-application-controller -n argocd
|
|
```
|
|
|
|
#### Multiple Repository Setup
|
|
|
|
For the three-repository pattern (launchpad, forte-helm, helm-values):
|
|
|
|
```bash
|
|
# 1. launchpad (main config repo)
|
|
ssh-keygen -t ed25519 -C "argocd-launchpad" -f key-sturdy -N ""
|
|
# Add key-sturdy.pub to: https://git.forteapps.net/Forte/launchpad/settings/keys
|
|
|
|
# 2. helm-values (private values repo)
|
|
ssh-keygen -t ed25519 -C "argocd-helm-values" -f key-helm-values -N ""
|
|
# Add key-helm-values.pub to: https://github.com/fortedigital/helm-values/settings/keys
|
|
|
|
# 3. forte-helm (private helm charts repo)
|
|
|
|
# Create secrets
|
|
kubectl create secret generic repo-launchpad \
|
|
--from-file=sshPrivateKey=key-sturdy \
|
|
--namespace=argocd --dry-run=client -o yaml | \
|
|
kubectl label --local -f - argocd.argoproj.io/secret-type=repository --dry-run=client -o yaml | \
|
|
kubectl apply -f -
|
|
|
|
kubectl create secret generic repo-helm-values \
|
|
--from-file=sshPrivateKey=key-helm-values \
|
|
--namespace=argocd --dry-run=client -o yaml | \
|
|
kubectl label --local -f - argocd.argoproj.io/secret-type=repository --dry-run=client -o yaml | \
|
|
kubectl apply -f -
|
|
|
|
# Clean up keys
|
|
shred -u key-sturdy key-helm-values
|
|
```
|
|
|
|
#### Converting HTTPS to SSH
|
|
|
|
If you're currently using HTTPS and want to switch to SSH:
|
|
|
|
```bash
|
|
# 1. Generate and add deploy key (see steps above)
|
|
|
|
# 2. Update all Application manifests
|
|
# Change from:
|
|
# repoURL: https://git.forteapps.net/Forte/launchpad
|
|
# To:
|
|
# repoURL: ssh://git@git.forteapps.net:2222/Forte/launchpad.git
|
|
|
|
# 3. Update and commit
|
|
find . -name "*.yaml" -type f -exec sed -i 's|https://github.com/fortedigital/|git@github.com:fortedigital/|g' {} +
|
|
|
|
git add .
|
|
git commit -m "Switch from HTTPS to SSH for repository access"
|
|
git push
|
|
|
|
# 4. ArgoCD will automatically re-sync with new SSH URLs
|
|
```
|
|
|
|
---
|
|
|
|
## Day-to-Day Operations
|
|
|
|
### Monitoring ArgoCD Sync Status
|
|
|
|
#### Via Slack
|
|
|
|
All applications send notifications to shared Slack channel:
|
|
- ✅ `on-sync-succeeded` - Deployment succeeded
|
|
- ❌ `on-sync-failed` - Deployment failed
|
|
- ⚠️ `on-degraded` - Application unhealthy
|
|
|
|
#### Via CLI
|
|
|
|
```bash
|
|
# List all applications
|
|
kubectl get applications -n argocd
|
|
|
|
# Watch application status
|
|
kubectl get applications -n argocd -w
|
|
|
|
# Get detailed status
|
|
kubectl describe application myapp -n argocd
|
|
```
|
|
|
|
#### Via ArgoCD UI
|
|
|
|
```bash
|
|
# Port forward to UI
|
|
kubectl port-forward svc/argocd-server -n argocd 8080:443
|
|
|
|
# Access: https://localhost:8080
|
|
# No login required (insecure mode for internal use)
|
|
```
|
|
|
|
### Checking Application Health
|
|
|
|
```bash
|
|
# Quick health check for all apps
|
|
kubectl get applications -n argocd \
|
|
-o custom-columns=NAME:.metadata.name,SYNC:.status.sync.status,HEALTH:.status.health.status
|
|
|
|
# Expected output:
|
|
# NAME SYNC HEALTH
|
|
# infrastructure-apps Synced Healthy
|
|
# enterprise-apps Synced Healthy
|
|
# mcp10x Synced Healthy
|
|
# musicman Synced Healthy
|
|
```
|
|
|
|
### Manual Sync
|
|
|
|
Force sync an application:
|
|
|
|
```bash
|
|
# Trigger sync
|
|
kubectl patch application myapp -n argocd \
|
|
--type merge \
|
|
-p '{"metadata":{"annotations":{"argocd.argoproj.io/refresh":"hard"}}}'
|
|
|
|
# Or via ArgoCD CLI (if installed)
|
|
argocd app sync myapp
|
|
```
|
|
|
|
### Pausing Auto-Sync
|
|
|
|
Temporarily disable automatic syncing:
|
|
|
|
```bash
|
|
# Edit application
|
|
kubectl edit application myapp -n argocd
|
|
|
|
# Set automated to null
|
|
spec:
|
|
syncPolicy:
|
|
automated: null # Disable auto-sync
|
|
|
|
# Re-enable later
|
|
spec:
|
|
syncPolicy:
|
|
automated:
|
|
prune: true
|
|
selfHeal: true
|
|
```
|
|
|
|
---
|
|
|
|
## Application Management
|
|
|
|
### Deploying a New Application
|
|
|
|
See [Developer Guide](DEVELOPER-GUIDE.md#deploying-your-first-application) for detailed steps.
|
|
|
|
**Quick checklist:**
|
|
- [ ] Create `helm-values/myapp/values.yaml`
|
|
- [ ] Create `apps/myapp.yaml` in config repo
|
|
- [ ] Create SealedSecret if needed
|
|
- [ ] Commit and push changes
|
|
- [ ] Verify sync in Slack/ArgoCD
|
|
- [ ] Configure DNS for domain
|
|
- [ ] Test application accessibility
|
|
|
|
### Removing an Application
|
|
|
|
#### Safe Removal Procedure
|
|
|
|
```bash
|
|
# 1. Delete ArgoCD Application (with cascade)
|
|
kubectl delete application myapp -n argocd
|
|
|
|
# This will:
|
|
# - Remove application from ArgoCD
|
|
# - Delete all Kubernetes resources (cascade)
|
|
# - Remove namespace
|
|
|
|
# 2. Clean up Git repositories
|
|
cd ~/dev/k8s/launchpad
|
|
git rm apps/myapp.yaml
|
|
git commit -m "Remove myapp application"
|
|
git push
|
|
|
|
cd ~/dev/k8s/helm-prod-values
|
|
git rm -r myapp/
|
|
git commit -m "Remove myapp values"
|
|
git push
|
|
|
|
# 3. Remove sealed secrets (if any)
|
|
cd ~/dev/k8s/launchpad
|
|
git rm secrets/myapp-credentials-sealed.yaml
|
|
git commit -m "Remove myapp secrets"
|
|
git push
|
|
```
|
|
|
|
#### Removal Without Cascade
|
|
|
|
To remove from ArgoCD but keep resources running:
|
|
|
|
```bash
|
|
# Delete application with no cascade
|
|
kubectl patch application myapp -n argocd \
|
|
-p '{"metadata":{"finalizers":[]}}' --type merge
|
|
kubectl delete application myapp -n argocd
|
|
|
|
# Resources remain in cluster but are no longer managed
|
|
```
|
|
|
|
### Scaling Applications
|
|
|
|
#### Manual Scaling
|
|
|
|
```bash
|
|
# Scale deployment directly
|
|
kubectl scale deployment myapp -n myapp --replicas=3
|
|
|
|
# Note: If selfHeal is enabled, this will be reverted
|
|
```
|
|
|
|
#### GitOps Scaling
|
|
|
|
Update `helm-values/myapp/values.yaml`:
|
|
|
|
```yaml
|
|
app:
|
|
replicaCount: 3 # Change from 1 to 3
|
|
```
|
|
|
|
Commit and push - ArgoCD will sync.
|
|
|
|
#### Auto-Scaling (HPA)
|
|
|
|
Enable Horizontal Pod Autoscaler:
|
|
|
|
```yaml
|
|
# In helm-values/myapp/values.yaml
|
|
app:
|
|
hpa:
|
|
enabled: true
|
|
minReplicas: 2
|
|
maxReplicas: 10
|
|
targetCPUUtilizationPercentage: 70
|
|
```
|
|
|
|
**Note:** Remove `replicaCount` from ArgoCD ignore list if using HPA:
|
|
|
|
```yaml
|
|
# In apps/myapp.yaml
|
|
ignoreDifferences:
|
|
- group: apps
|
|
kind: Deployment
|
|
jsonPointers:
|
|
- /spec/replicas # Remove this line
|
|
```
|
|
|
|
### Rolling Back Deployments
|
|
|
|
#### Option 1: Git Revert
|
|
|
|
```bash
|
|
# Find the commit before the bad change
|
|
cd ~/dev/k8s/helm-prod-values
|
|
git log --oneline myapp/values.yaml
|
|
|
|
# Revert to previous version
|
|
git revert <commit-hash>
|
|
git push
|
|
|
|
# ArgoCD will sync the rollback
|
|
```
|
|
|
|
#### Option 2: Manual Rollback
|
|
|
|
```bash
|
|
# Rollback to previous revision
|
|
kubectl rollout undo deployment myapp -n myapp
|
|
|
|
# Note: This will be reverted by ArgoCD selfHeal
|
|
# Make permanent by updating Git
|
|
```
|
|
|
|
#### Option 3: Change Image Tag
|
|
|
|
```bash
|
|
# Edit helm-values
|
|
cd ~/dev/k8s/helm-prod-values
|
|
vim myapp/values.yaml
|
|
|
|
# Change image tag to previous version
|
|
app:
|
|
image:
|
|
tag: v1.0.0 # Roll back from v1.0.1
|
|
|
|
# Commit and push
|
|
git add myapp/values.yaml
|
|
git commit -m "Rollback myapp to v1.0.0"
|
|
git push
|
|
```
|
|
|
|
### Resource Updates
|
|
|
|
#### Update Resource Limits
|
|
|
|
```yaml
|
|
# In helm-values/myapp/values.yaml
|
|
app:
|
|
resources:
|
|
requests:
|
|
cpu: 200m # Increased from 100m
|
|
memory: 512Mi # Increased from 256Mi
|
|
limits:
|
|
cpu: 1000m
|
|
memory: 2Gi
|
|
```
|
|
|
|
#### Enable Database
|
|
|
|
```yaml
|
|
# In helm-values/myapp/values.yaml
|
|
db:
|
|
enabled: true
|
|
persistence:
|
|
size: 10Gi # Increase storage
|
|
```
|
|
|
|
---
|
|
|
|
## Secret Management
|
|
|
|
### Creating Secrets
|
|
|
|
#### Step 1: Get Public Certificate
|
|
|
|
```bash
|
|
# Fetch sealed-secrets public cert (one-time)
|
|
kubeseal --fetch-cert \
|
|
--controller-name=sealed-secrets-controller \
|
|
--controller-namespace=kube-system \
|
|
> pub-cert.pem
|
|
|
|
# Save this certificate for future use
|
|
```
|
|
|
|
#### Step 2: Create Plain Secret
|
|
|
|
```bash
|
|
# Method 1: From literal values
|
|
kubectl create secret generic myapp-credentials \
|
|
--from-literal=API_KEY=secret123 \
|
|
--from-literal=DB_PASSWORD=pass456 \
|
|
--namespace=myapp \
|
|
--dry-run=client -o yaml > private/myapp-credentials.yaml
|
|
|
|
# Method 2: From file
|
|
kubectl create secret generic myapp-credentials \
|
|
--from-file=.env \
|
|
--namespace=myapp \
|
|
--dry-run=client -o yaml > private/myapp-credentials.yaml
|
|
|
|
# Method 3: From multiple files
|
|
kubectl create secret generic myapp-credentials \
|
|
--from-file=api-key.txt \
|
|
--from-file=db-password.txt \
|
|
--namespace=myapp \
|
|
--dry-run=client -o yaml > private/myapp-credentials.yaml
|
|
```
|
|
|
|
#### Step 3: Seal Secret
|
|
|
|
```bash
|
|
kubeseal --format=yaml \
|
|
--cert=pub-cert.pem \
|
|
--namespace=myapp \
|
|
< private/myapp-credentials.yaml \
|
|
> secrets/myapp-credentials-sealed.yaml
|
|
```
|
|
|
|
#### Step 4: Commit Sealed Secret
|
|
|
|
```bash
|
|
git add secrets/myapp-credentials-sealed.yaml
|
|
git commit -m "Add myapp credentials"
|
|
git push
|
|
|
|
# Delete plain secret
|
|
rm private/myapp-credentials.yaml
|
|
```
|
|
|
|
### Updating Secrets
|
|
|
|
```bash
|
|
# 1. Create new version
|
|
kubectl create secret generic myapp-credentials \
|
|
--from-literal=API_KEY=new-secret-key \
|
|
--from-literal=DB_PASSWORD=new-password \
|
|
--namespace=myapp \
|
|
--dry-run=client -o yaml > private/myapp-credentials.yaml
|
|
|
|
# 2. Seal it
|
|
kubeseal --format=yaml \
|
|
--cert=pub-cert.pem \
|
|
--namespace=myapp \
|
|
< private/myapp-credentials.yaml \
|
|
> secrets/myapp-credentials-sealed.yaml
|
|
|
|
# 3. Commit
|
|
git add secrets/myapp-credentials-sealed.yaml
|
|
git commit -m "Update myapp credentials"
|
|
git push
|
|
|
|
# 4. Restart pods to pick up new secret
|
|
kubectl rollout restart deployment myapp -n myapp
|
|
|
|
# 5. Delete plain secret
|
|
rm private/myapp-credentials.yaml
|
|
```
|
|
|
|
### Viewing Secrets (Unsealed)
|
|
|
|
```bash
|
|
# List secrets in namespace
|
|
kubectl get secrets -n myapp
|
|
|
|
# Describe secret (doesn't show values)
|
|
kubectl describe secret myapp-credentials -n myapp
|
|
|
|
# View secret values (base64 encoded)
|
|
kubectl get secret myapp-credentials -n myapp -o yaml
|
|
|
|
# Decode secret value
|
|
kubectl get secret myapp-credentials -n myapp \
|
|
-o jsonpath='{.data.API_KEY}' | base64 -d
|
|
```
|
|
|
|
### Secret Cloning (Kyverno)
|
|
|
|
Secrets labeled `allowedToBeCloned: "true"` in the `secrets` namespace are automatically cloned to new namespaces.
|
|
|
|
```yaml
|
|
# Example: secrets-namespace.yaml
|
|
apiVersion: v1
|
|
kind: Secret
|
|
metadata:
|
|
name: shared-credentials
|
|
namespace: secrets
|
|
labels:
|
|
allowedToBeCloned: "true"
|
|
type: Opaque
|
|
data:
|
|
API_KEY: <base64-encoded-value>
|
|
```
|
|
|
|
When a new namespace is created, Kyverno automatically copies this secret.
|
|
|
|
### Authentication Secrets
|
|
|
|
Applications using the authentication sidecar require specific secrets depending on the auth mode.
|
|
|
|
#### Token Mode Secrets
|
|
|
|
Token-based auth uses an `auth-tokens` Secret:
|
|
|
|
```bash
|
|
# Method 1: From Helm values (automatic)
|
|
# Tokens specified in values.yaml are automatically created
|
|
|
|
# Method 2: Manual creation
|
|
kubectl create secret generic auth-tokens \
|
|
--from-literal=tokens="token1
|
|
token2
|
|
token3" \
|
|
--namespace=myapp
|
|
|
|
# Method 3: From file
|
|
echo "d4f88f6d9292c10cc3e21c4aad56d2be485db532b54fe961d738e1137d247823" > tokens.txt
|
|
echo "8803f621acc3898df1d7a8f514bc3602551a0681a8f747bd4e43c3c5849d57a7" >> tokens.txt
|
|
kubectl create secret generic auth-tokens \
|
|
--from-file=tokens=tokens.txt \
|
|
--namespace=myapp
|
|
rm tokens.txt
|
|
```
|
|
|
|
#### OIDC Mode Secrets
|
|
|
|
OIDC auth requires an `auth-oidc` Secret with two keys:
|
|
|
|
```bash
|
|
# Generate secrets
|
|
CLIENT_SECRET="your-oidc-client-secret-from-provider"
|
|
COOKIE_SECRET=$(openssl rand -hex 32)
|
|
|
|
# Create plain secret
|
|
kubectl create secret generic auth-oidc \
|
|
--from-literal=client-secret=$CLIENT_SECRET \
|
|
--from-literal=cookie-secret=$COOKIE_SECRET \
|
|
--namespace=myapp \
|
|
--dry-run=client -o yaml > private/myapp-auth-oidc.yaml
|
|
|
|
# Seal it
|
|
kubeseal --format=yaml \
|
|
--cert=pub-cert.pem \
|
|
--namespace=myapp \
|
|
< private/myapp-auth-oidc.yaml \
|
|
> secrets/myapp-auth-oidc-sealed.yaml
|
|
|
|
# Apply sealed secret
|
|
kubectl apply -f secrets/myapp-auth-oidc-sealed.yaml
|
|
|
|
# Commit to Git
|
|
git add secrets/myapp-auth-oidc-sealed.yaml
|
|
git commit -m "Add OIDC secrets for myapp"
|
|
git push
|
|
|
|
# Clean up
|
|
rm private/myapp-auth-oidc.yaml
|
|
```
|
|
|
|
#### Rotating Authentication Secrets
|
|
|
|
**Token Rotation**:
|
|
|
|
```bash
|
|
# Generate new token
|
|
NEW_TOKEN=$(openssl rand -hex 32)
|
|
|
|
# Get current tokens
|
|
kubectl get secret auth-tokens -n myapp -o yaml > /tmp/tokens.yaml
|
|
|
|
# Edit tokens (add new, optionally remove old)
|
|
# Then re-seal and apply
|
|
|
|
# Restart pods to use new tokens
|
|
kubectl rollout restart deployment myapp -n myapp
|
|
```
|
|
|
|
**OIDC Secret Rotation**:
|
|
|
|
```bash
|
|
# Rotate cookie secret (safe - invalidates existing sessions)
|
|
NEW_COOKIE_SECRET=$(openssl rand -hex 32)
|
|
|
|
# Recreate secret
|
|
kubectl create secret generic auth-oidc \
|
|
--from-literal=client-secret=$CLIENT_SECRET \
|
|
--from-literal=cookie-secret=$NEW_COOKIE_SECRET \
|
|
--namespace=myapp \
|
|
--dry-run=client -o yaml | \
|
|
kubeseal --format=yaml --cert=pub-cert.pem --namespace=myapp | \
|
|
kubectl apply -f -
|
|
|
|
# Restart to pick up new secret
|
|
kubectl rollout restart deployment myapp -n myapp
|
|
```
|
|
|
|
#### Viewing Authentication Secrets
|
|
|
|
```bash
|
|
# List auth-related secrets
|
|
kubectl get secrets -n myapp | grep auth
|
|
|
|
# View token secret (tokens are in plain text in the Secret)
|
|
kubectl get secret auth-tokens -n myapp -o jsonpath='{.data.tokens}' | base64 -d
|
|
|
|
# View OIDC secret keys (values are base64 encoded)
|
|
kubectl get secret auth-oidc -n myapp -o jsonpath='{.data.client-secret}' | base64 -d
|
|
kubectl get secret auth-oidc -n myapp -o jsonpath='{.data.cookie-secret}' | base64 -d
|
|
```
|
|
|
|
**See**: [Developer Guide - Enabling Authentication](../docs/DEVELOPER-GUIDE.md#enabling-authentication-for-applications) for complete authentication setup guide.
|
|
|
|
---
|
|
|
|
## Monitoring & Alerting
|
|
|
|
### Prometheus Metrics
|
|
|
|
```bash
|
|
# Port forward to Prometheus
|
|
kubectl port-forward -n monitoring svc/prometheus-server 9090:80
|
|
|
|
# Access: http://localhost:9090
|
|
```
|
|
|
|
**Common Queries:**
|
|
```promql
|
|
# CPU usage per pod
|
|
sum(rate(container_cpu_usage_seconds_total[5m])) by (pod)
|
|
|
|
# Memory usage per pod
|
|
sum(container_memory_usage_bytes) by (pod)
|
|
|
|
# Request rate per service
|
|
rate(http_requests_total[5m])
|
|
```
|
|
|
|
### Grafana Dashboards
|
|
|
|
```bash
|
|
# Port forward to Grafana
|
|
kubectl port-forward -n monitoring svc/grafana 3000:80
|
|
|
|
# Access: http://localhost:3000
|
|
```
|
|
|
|
### Loki Logs
|
|
|
|
```bash
|
|
# Port forward to Loki
|
|
kubectl port-forward -n monitoring svc/loki 3100:3100
|
|
|
|
# Query logs
|
|
curl -G -s 'http://localhost:3100/loki/api/v1/query_range' \
|
|
--data-urlencode 'query={namespace="myapp"}' \
|
|
--data-urlencode 'start=1h' | jq
|
|
```
|
|
|
|
### Tempo Traces
|
|
|
|
```bash
|
|
# Port forward to Tempo query API
|
|
kubectl port-forward -n monitoring svc/tempo 3200:3200
|
|
|
|
# Access: http://localhost:3200
|
|
```
|
|
|
|
**Query traces via Grafana:**
|
|
1. Open Grafana → Explore
|
|
2. Select Tempo datasource
|
|
3. Use TraceQL or search by service name
|
|
|
|
**Verify Traefik is sending traces:**
|
|
```bash
|
|
# Check Traefik logs for OTLP export errors
|
|
kubectl logs -n traefik-system -l app.kubernetes.io/name=traefik | grep -i "traces export"
|
|
|
|
# Check Tempo is receiving data
|
|
kubectl logs -n monitoring -l app.kubernetes.io/name=tempo | grep "receiver"
|
|
```
|
|
|
|
**Trace-to-log correlation:**
|
|
- Click a trace span in Grafana → linked Loki logs appear (by namespace, pod, container)
|
|
- Trace-to-metrics links to Prometheus by service name
|
|
|
|
### Fluent-Bit Log Shipping
|
|
|
|
Verify Fluent-Bit is shipping logs:
|
|
|
|
```bash
|
|
# Check Fluent-Bit pods
|
|
kubectl get pods -n monitoring | grep fluent-bit
|
|
|
|
# Check logs
|
|
kubectl logs -n monitoring daemonset/fluent-bit
|
|
|
|
# Verify Loki is receiving logs
|
|
kubectl logs -n monitoring deployment/loki | grep "POST /loki/api/v1/push"
|
|
```
|
|
|
|
### Trivy Vulnerability Scanning
|
|
|
|
```bash
|
|
# Check Trivy scan results
|
|
kubectl get vulnerabilityreports --all-namespaces
|
|
|
|
# View report for specific pod
|
|
kubectl describe vulnerabilityreport -n myapp <report-name>
|
|
```
|
|
|
|
### Slack Notifications
|
|
|
|
All applications have Slack notifications enabled:
|
|
|
|
```yaml
|
|
metadata:
|
|
annotations:
|
|
notifications.argoproj.io/subscribe.on-sync-succeeded.slack: ""
|
|
notifications.argoproj.io/subscribe.on-sync-failed.slack: ""
|
|
notifications.argoproj.io/subscribe.on-degraded.slack: ""
|
|
```
|
|
|
|
**Test Notification:**
|
|
```bash
|
|
# Trigger a sync to test
|
|
kubectl patch application myapp -n argocd \
|
|
--type merge \
|
|
-p '{"metadata":{"annotations":{"argocd.argoproj.io/refresh":"hard"}}}'
|
|
```
|
|
|
|
---
|
|
|
|
## Troubleshooting
|
|
|
|
### Application Won't Sync
|
|
|
|
#### Check Application Status
|
|
|
|
```bash
|
|
kubectl describe application myapp -n argocd
|
|
```
|
|
|
|
Look for errors in:
|
|
- `Status.Conditions`
|
|
- `Status.OperationState`
|
|
|
|
#### Common Issues
|
|
|
|
**Issue 1: Image Pull Error**
|
|
```bash
|
|
# Error: ErrImagePull, ImagePullBackOff
|
|
|
|
# Check if image exists
|
|
docker pull ghcr.io/fortedigital/myapp:v1.0.0
|
|
|
|
# Check image pull secrets
|
|
kubectl get secrets -n myapp | grep regcred
|
|
|
|
# Check pod events
|
|
kubectl describe pod -n myapp <pod-name>
|
|
```
|
|
|
|
**Issue 2: Invalid YAML**
|
|
```bash
|
|
# Error: unable to decode manifest
|
|
|
|
# Validate YAML locally
|
|
kubectl apply --dry-run=client -f apps/myapp.yaml
|
|
|
|
# Check ArgoCD application controller logs
|
|
kubectl logs -n argocd deployment/argocd-application-controller | grep myapp
|
|
```
|
|
|
|
**Issue 3: Resource Quota Exceeded**
|
|
```bash
|
|
# Error: exceeded quota
|
|
|
|
# Check namespace quotas
|
|
kubectl get resourcequota -n myapp
|
|
kubectl describe resourcequota -n myapp
|
|
|
|
# Increase quota or reduce resource requests
|
|
```
|
|
|
|
### Pod Crashes
|
|
|
|
#### CrashLoopBackOff
|
|
|
|
```bash
|
|
# Check pod status
|
|
kubectl get pods -n myapp
|
|
|
|
# View logs
|
|
kubectl logs -n myapp <pod-name>
|
|
kubectl logs -n myapp <pod-name> --previous # Previous container
|
|
|
|
# Check events
|
|
kubectl describe pod -n myapp <pod-name>
|
|
```
|
|
|
|
**Common Causes:**
|
|
- Application error (check logs)
|
|
- Missing environment variables
|
|
- Wrong port configuration
|
|
- Missing secrets
|
|
- Insufficient memory/CPU
|
|
|
|
#### ImagePullBackOff
|
|
|
|
```bash
|
|
# Check image name
|
|
kubectl get deployment myapp -n myapp -o yaml | grep image
|
|
|
|
# Verify credentials
|
|
kubectl get secret -n myapp
|
|
```
|
|
|
|
#### Pending
|
|
|
|
```bash
|
|
# Check why pod is pending
|
|
kubectl describe pod -n myapp <pod-name>
|
|
|
|
# Common reasons:
|
|
# - Insufficient resources on nodes
|
|
# - PVC not bound
|
|
# - Node selector doesn't match
|
|
```
|
|
|
|
### Ingress / TLS Issues
|
|
|
|
#### Application Not Accessible
|
|
|
|
```bash
|
|
# Check IngressRoute
|
|
kubectl get ingressroute -n myapp
|
|
kubectl describe ingressroute myapp -n myapp
|
|
|
|
# Check Traefik
|
|
kubectl get pods -n traefik
|
|
kubectl logs -n traefik deployment/traefik
|
|
|
|
# Test with port-forward
|
|
kubectl port-forward -n myapp service/myapp 8080:3000
|
|
curl http://localhost:8080
|
|
```
|
|
|
|
#### Certificate Issues
|
|
|
|
```bash
|
|
# Check certificates
|
|
kubectl get certificate -n myapp
|
|
kubectl describe certificate myapp-tls -n myapp
|
|
|
|
# Check cert-manager
|
|
kubectl get clusterissuer
|
|
kubectl logs -n cert-manager deployment/cert-manager
|
|
|
|
# Check Let's Encrypt challenges
|
|
kubectl get challenges --all-namespaces
|
|
```
|
|
|
|
**Manual Certificate Renewal:**
|
|
```bash
|
|
# Delete and recreate certificate
|
|
kubectl delete certificate myapp-tls -n myapp
|
|
|
|
# Certificate will be automatically recreated
|
|
```
|
|
|
|
### Database Issues
|
|
|
|
#### PostgreSQL Won't Start
|
|
|
|
```bash
|
|
# Check StatefulSet
|
|
kubectl get statefulset -n myapp
|
|
kubectl describe statefulset postgres -n myapp
|
|
|
|
# Check PVC
|
|
kubectl get pvc -n myapp
|
|
kubectl describe pvc -n myapp
|
|
|
|
# Check logs
|
|
kubectl logs -n myapp postgres-0
|
|
```
|
|
|
|
#### Data Persistence
|
|
|
|
```bash
|
|
# Verify PVC is bound
|
|
kubectl get pvc -n myapp
|
|
|
|
# Check storage class
|
|
kubectl get storageclass
|
|
|
|
# Resize PVC (if supported)
|
|
kubectl edit pvc postgres-data-postgres-0 -n myapp
|
|
# Change: storage: 10Gi (from 5Gi)
|
|
```
|
|
|
|
### Kyverno Policy Issues
|
|
|
|
#### Policy Violations
|
|
|
|
```bash
|
|
# List policies
|
|
kubectl get clusterpolicy
|
|
|
|
# Check policy reports
|
|
kubectl get policyreport --all-namespaces
|
|
|
|
# View specific policy
|
|
kubectl describe clusterpolicy secret-cloner
|
|
```
|
|
|
|
#### Secret Not Cloned
|
|
|
|
```bash
|
|
# Check if secret has label
|
|
kubectl get secret -n secrets --show-labels
|
|
|
|
# Check Kyverno logs
|
|
kubectl logs -n kyverno deployment/kyverno
|
|
|
|
# Manually trigger by recreating namespace
|
|
kubectl delete ns test-ns
|
|
kubectl create ns test-ns
|
|
```
|
|
|
|
### ArgoCD Issues
|
|
|
|
#### ArgoCD UI Not Accessible
|
|
|
|
```bash
|
|
# Check ArgoCD pods
|
|
kubectl get pods -n argocd
|
|
|
|
# Restart ArgoCD server
|
|
kubectl rollout restart deployment argocd-server -n argocd
|
|
|
|
# Port forward
|
|
kubectl port-forward svc/argocd-server -n argocd 8080:443
|
|
```
|
|
|
|
#### Sync Takes Too Long
|
|
|
|
```bash
|
|
# Check application controller logs
|
|
kubectl logs -n argocd deployment/argocd-application-controller
|
|
|
|
# Increase timeout (in apps/myapp.yaml)
|
|
spec:
|
|
syncPolicy:
|
|
retry:
|
|
backoff:
|
|
maxDuration: 5m # Increase from 3m
|
|
```
|
|
|
|
---
|
|
|
|
## Disaster Recovery
|
|
|
|
### Backup Strategy
|
|
|
|
**Current State**: No automated backups
|
|
|
|
**What Needs Backup**:
|
|
- ❌ Cluster state (not backed up - recreate via GitOps)
|
|
- ❌ Persistent volumes (currently not critical)
|
|
- ✅ Git repositories (GitHub provides backup)
|
|
- ⚠️ Secrets (sealed secrets in Git, unseal keys need safekeeping)
|
|
|
|
### Cluster Rebuild
|
|
|
|
**Scenario**: Complete cluster failure
|
|
|
|
```bash
|
|
# 1. Provision new Kubernetes cluster
|
|
|
|
# 2. Configure kubectl
|
|
kubectl config use-context new-cluster
|
|
kubectl cluster-info
|
|
|
|
# 3. Bootstrap cluster
|
|
cd ~/dev/k8s/launchpad
|
|
./bootstrap.sh
|
|
|
|
# 4. Wait for ArgoCD to sync all applications
|
|
kubectl get applications -n argocd -w
|
|
|
|
# 5. Recreate any unsealed secrets (from password manager)
|
|
# 6. Configure DNS for new cluster IPs
|
|
# 7. Verify all applications are healthy
|
|
```
|
|
|
|
**Time Estimate**: 30-60 minutes
|
|
|
|
**Data Loss**:
|
|
- Ephemeral data: Lost
|
|
- Database data: Lost (no backups currently)
|
|
- Configuration: No loss (in Git)
|
|
|
|
### Future Backup Plan
|
|
|
|
**Recommended**:
|
|
|
|
1. **Velero** for cluster backups
|
|
```bash
|
|
helm install velero vmware-tanzu/velero \
|
|
--namespace velero \
|
|
--create-namespace \
|
|
--set configuration.provider=aws \
|
|
--set configuration.backupStorageLocation[0].bucket=cluster-backups
|
|
```
|
|
|
|
2. **PostgreSQL backups** via CronJob
|
|
```yaml
|
|
# pg-backup-cronjob.yaml
|
|
kind: CronJob
|
|
spec:
|
|
schedule: "0 2 * * *" # Daily at 2am
|
|
jobTemplate:
|
|
spec:
|
|
template:
|
|
spec:
|
|
containers:
|
|
- name: pg-dump
|
|
image: postgres:16-alpine
|
|
command:
|
|
- /bin/sh
|
|
- -c
|
|
- pg_dump -U $DB_USER -d $DB_NAME > /backup/dump-$(date +%Y%m%d).sql
|
|
```
|
|
|
|
3. **Sealed Secrets private key backup**
|
|
```bash
|
|
# Backup sealed-secrets controller private key
|
|
kubectl get secret -n kube-system sealed-secrets-key \
|
|
-o yaml > sealed-secrets-key-backup.yaml
|
|
|
|
# Store in secure location (password manager, vault)
|
|
```
|
|
|
|
---
|
|
|
|
## Maintenance Procedures
|
|
|
|
### Upgrading ArgoCD
|
|
|
|
```bash
|
|
# Check current version
|
|
kubectl get deployment argocd-server -n argocd \
|
|
-o jsonpath='{.spec.template.spec.containers[0].image}'
|
|
|
|
# Update version in values
|
|
vim infra/values/base/argocd-values.yaml
|
|
|
|
# Or upgrade via Helm directly
|
|
helm upgrade argocd argo-cd \
|
|
--repo https://argoproj.github.io/argo-helm \
|
|
--namespace argocd \
|
|
--values infra/values/base/argocd-values.yaml \
|
|
--version 6.0.0 # New version
|
|
|
|
# Verify
|
|
kubectl get pods -n argocd
|
|
```
|
|
|
|
### Upgrading Kubernetes Version
|
|
|
|
```bash
|
|
# UpCloud: Upgrade via control panel or CLI
|
|
|
|
# After upgrade, verify cluster
|
|
kubectl version
|
|
kubectl get nodes
|
|
|
|
# Check for deprecated APIs
|
|
kubectl api-resources
|
|
|
|
# Update any deprecated resources in Git
|
|
```
|
|
|
|
### Rotating TLS Certificates
|
|
|
|
Let's Encrypt certificates auto-renew, but if manual rotation is needed:
|
|
|
|
```bash
|
|
# Delete certificate to force renewal
|
|
kubectl delete certificate myapp-tls -n myapp
|
|
|
|
# Cert-manager will automatically recreate
|
|
kubectl get certificate -n myapp -w
|
|
```
|
|
|
|
### Cleaning Up Old Resources
|
|
|
|
```bash
|
|
# List all namespaces
|
|
kubectl get namespaces
|
|
|
|
# Remove unused namespaces
|
|
kubectl delete namespace old-app
|
|
|
|
# Clean up ArgoCD applications
|
|
kubectl get applications -n argocd
|
|
kubectl delete application old-app -n argocd
|
|
|
|
# Clean up old Docker images (on nodes)
|
|
# SSH to nodes and run:
|
|
docker image prune -a --filter "until=720h" # 30 days
|
|
```
|
|
|
|
### DNS Management
|
|
|
|
**Adding New Subdomain**:
|
|
|
|
1. Add DNS A record pointing to Traefik LoadBalancer IP
|
|
```bash
|
|
# Get LoadBalancer IP
|
|
kubectl get svc -n traefik traefik -o jsonpath='{.status.loadBalancer.ingress[0].ip}'
|
|
```
|
|
|
|
2. Add to DNS provider:
|
|
```
|
|
myapp.forteapps.net A <LoadBalancer-IP>
|
|
```
|
|
|
|
3. Verify DNS propagation:
|
|
```bash
|
|
nslookup myapp.forteapps.net
|
|
dig myapp.forteapps.net
|
|
```
|
|
|
|
### Monitoring Resource Usage
|
|
|
|
```bash
|
|
# Node resource usage
|
|
kubectl top nodes
|
|
|
|
# Pod resource usage
|
|
kubectl top pods --all-namespaces
|
|
|
|
# Identify resource hogs
|
|
kubectl top pods --all-namespaces --sort-by=memory
|
|
kubectl top pods --all-namespaces --sort-by=cpu
|
|
```
|
|
|
|
---
|
|
|
|
## Advanced Operations
|
|
|
|
### Adding a New Infrastructure Component
|
|
|
|
Example: Adding Redis
|
|
|
|
```bash
|
|
# 1. Create application manifest in base/
|
|
cat > infra/base/redis-application.yaml <<EOF
|
|
apiVersion: argoproj.io/v1alpha1
|
|
kind: Application
|
|
metadata:
|
|
name: redis
|
|
namespace: argocd
|
|
annotations:
|
|
argocd.argoproj.io/sync-wave: "1"
|
|
spec:
|
|
project: default
|
|
sources:
|
|
- repoURL: https://charts.bitnami.com/bitnami
|
|
chart: redis
|
|
targetRevision: 18.0.0
|
|
helm:
|
|
releaseName: redis
|
|
valueFiles:
|
|
- \$values/infra/values/base/redis-values.yaml
|
|
- repoURL: ssh://git@git.forteapps.net:2222/Forte/launchpad.git
|
|
targetRevision: HEAD
|
|
ref: values
|
|
destination:
|
|
server: https://kubernetes.default.svc
|
|
namespace: redis
|
|
syncPolicy:
|
|
automated:
|
|
prune: true
|
|
selfHeal: true
|
|
syncOptions:
|
|
- CreateNamespace=true
|
|
EOF
|
|
|
|
# 2. Add to base kustomization
|
|
# Edit infra/base/kustomization.yaml and add: - redis-application.yaml
|
|
|
|
# 3. Create base values file
|
|
cat > infra/values/base/redis-values.yaml <<EOF
|
|
auth:
|
|
enabled: true
|
|
EOF
|
|
|
|
# 4. Commit and push
|
|
git add infra/base/redis-application.yaml infra/values/base/redis-values.yaml infra/base/kustomization.yaml
|
|
git commit -m "Add Redis infrastructure component"
|
|
git push
|
|
|
|
# 5. ArgoCD will auto-sync within 60 seconds
|
|
```
|
|
|
|
### Multi-Cluster Setup
|
|
|
|
The repository supports multiple clusters via Kustomize overlays:
|
|
|
|
- **upc-dev** (default): `infra/overlays/upc-dev/` — uses base Applications as-is
|
|
- **upc-prod**: `infra/overlays/upc-prod/` — patches value file paths from `upc-dev` to `upc-prod`
|
|
|
|
Each cluster has its own:
|
|
- Root app-of-apps file: `_app-of-apps-upc-dev.yaml` / `_app-of-apps-upc-prod.yaml`
|
|
- Cluster-specific Helm values: `infra/values/upc-dev/` / `infra/values/upc-prod/`
|
|
- Sealed secrets: `secrets/upc-dev/` (others as needed)
|
|
- Apps overlay: `apps/overlays/upc-dev/` / `apps/overlays/upc-prod/`
|
|
|
|
To add a new cluster, create a new overlay directory (e.g., `infra/overlays/upc-staging/`) with patches that swap the value file paths.
|
|
|
|
### Blue-Green Deployments
|
|
|
|
```bash
|
|
# Deploy blue version
|
|
helm install myapp-blue forteapp \
|
|
--set app.image.tag=v1.0.0
|
|
|
|
# Deploy green version
|
|
helm install myapp-green forteapp \
|
|
--set app.image.tag=v2.0.0
|
|
|
|
# Switch traffic via IngressRoute
|
|
kubectl patch ingressroute myapp -n myapp --type merge \
|
|
-p '{"spec":{"routes":[{"services":[{"name":"myapp-green"}]}]}}'
|
|
|
|
# Remove blue deployment after validation
|
|
helm uninstall myapp-blue
|
|
```
|
|
|
|
---
|
|
|
|
## Emergency Procedures
|
|
|
|
### Emergency Rollback
|
|
|
|
```bash
|
|
# Immediate rollback
|
|
kubectl rollout undo deployment myapp -n myapp
|
|
|
|
# Update Git to make permanent
|
|
cd ~/dev/k8s/helm-prod-values
|
|
git revert HEAD
|
|
git push
|
|
```
|
|
|
|
### Emergency Scale Down
|
|
|
|
```bash
|
|
# Scale to zero (maintenance mode)
|
|
kubectl scale deployment myapp -n myapp --replicas=0
|
|
|
|
# Update Git
|
|
vim helm-values/myapp/values.yaml
|
|
# Set replicaCount: 0
|
|
git commit -am "Scale down myapp for maintenance"
|
|
git push
|
|
```
|
|
|
|
### Emergency Application Removal
|
|
|
|
```bash
|
|
# Remove application but keep data
|
|
kubectl patch application myapp -n argocd \
|
|
-p '{"metadata":{"finalizers":[]}}' --type merge
|
|
kubectl delete application myapp -n argocd
|
|
|
|
# Resources remain in cluster
|
|
```
|
|
|
|
---
|
|
|
|
## Useful Scripts
|
|
|
|
### Sync All Applications
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
# sync-all.sh
|
|
for app in $(kubectl get applications -n argocd -o name); do
|
|
kubectl patch $app -n argocd \
|
|
--type merge \
|
|
-p '{"metadata":{"annotations":{"argocd.argoproj.io/refresh":"hard"}}}'
|
|
done
|
|
```
|
|
|
|
### Check All Applications Health
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
# health-check.sh
|
|
kubectl get applications -n argocd \
|
|
-o custom-columns=\
|
|
NAME:.metadata.name,\
|
|
SYNC:.status.sync.status,\
|
|
HEALTH:.status.health.status,\
|
|
MESSAGE:.status.health.message
|
|
```
|
|
|
|
### Seal Secret Helper
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
# seal-secret.sh
|
|
NAMESPACE=${1:-default}
|
|
SECRET_FILE=${2:-private/secret.yaml}
|
|
OUTPUT_FILE=${3:-secrets/secret-sealed.yaml}
|
|
|
|
kubeseal --format=yaml \
|
|
--cert=pub-cert.pem \
|
|
--namespace=$NAMESPACE \
|
|
< $SECRET_FILE \
|
|
> $OUTPUT_FILE
|
|
|
|
echo "Sealed secret created: $OUTPUT_FILE"
|
|
echo "Remember to delete: $SECRET_FILE"
|
|
```
|
|
|
|
---
|
|
|
|
## Checklist Templates
|
|
|
|
### New Application Deployment Checklist
|
|
|
|
- [ ] Application code repository created
|
|
- [ ] Dockerfile created and tested
|
|
- [ ] GitHub Actions workflow configured
|
|
- [ ] Helm values created in `helm-prod-values/`
|
|
- [ ] ArgoCD application manifest created in `apps/`
|
|
- [ ] Secrets created and sealed
|
|
- [ ] DNS record added for domain
|
|
- [ ] Application synced successfully
|
|
- [ ] Health check passed
|
|
- [ ] Slack notification received
|
|
- [ ] Application accessible via domain
|
|
- [ ] Monitoring configured
|
|
- [ ] Documentation updated
|
|
|
|
### Incident Response Checklist
|
|
|
|
- [ ] Incident identified (Slack alert, monitoring)
|
|
- [ ] Severity assessed
|
|
- [ ] Incident channel created
|
|
- [ ] Initial investigation (logs, metrics, events)
|
|
- [ ] Root cause identified
|
|
- [ ] Mitigation applied
|
|
- [ ] Verification of fix
|
|
- [ ] Post-mortem scheduled
|
|
- [ ] Documentation updated
|
|
|
|
---
|
|
|
|
**Last Updated**: 2026-03-16
|
|
**Maintained By**: Platform Team
|
|
**Emergency Contact**: #platform-support on Slack
|