From d02da33700d3519896a97be73e368a555e754613 Mon Sep 17 00:00:00 2001 From: Danijel Simeunovic Date: Mon, 16 Mar 2026 11:00:42 +0100 Subject: [PATCH] docs --- README.md | 532 +++++++++++---- docs/DEVELOPER-GUIDE.md | 1089 +++++++++++++++++++++++++++++++ docs/GITOPS-ARCHITECTURE.md | 640 ++++++++++++++++++ docs/OPERATIONS-RUNBOOK.md | 1217 +++++++++++++++++++++++++++++++++++ docs/README.md | 327 ++++++++++ docs/REFERENCE.md | 1070 ++++++++++++++++++++++++++++++ 6 files changed, 4766 insertions(+), 109 deletions(-) create mode 100644 docs/DEVELOPER-GUIDE.md create mode 100644 docs/GITOPS-ARCHITECTURE.md create mode 100644 docs/OPERATIONS-RUNBOOK.md create mode 100644 docs/README.md create mode 100644 docs/REFERENCE.md diff --git a/README.md b/README.md index 429dae3..e88d271 100644 --- a/README.md +++ b/README.md @@ -1,149 +1,463 @@ -## Overview +# Kubernetes Cluster - GitOps Configuration -This is a **Kubernetes cluster bootstrapping and GitOps configuration repository** using ArgoCD. It defines the infrastructure-as-code for deploying and managing applications, services, and policies on Kubernetes clusters. +> **Kubernetes cluster bootstrapping and GitOps configuration repository** using ArgoCD for UpCloud Managed Kubernetes -## Repository Structure +[![GitOps](https://img.shields.io/badge/GitOps-ArgoCD-blue)](https://argoproj.github.io/cd/) +[![Kubernetes](https://img.shields.io/badge/Kubernetes-UpCloud-orange)](https://upcloud.com/) + +--- + +## πŸ“š Complete Documentation + +**New developers and operators**: Please refer to our comprehensive documentation for detailed guides and references: + +### 🎯 [**START HERE: Documentation Index**](docs/README.md) + +| Document | Description | Audience | +|----------|-------------|----------| +| **[GitOps Architecture](docs/GITOPS-ARCHITECTURE.md)** | System architecture, repository structure, GitOps workflows, security model | Everyone (start here) | +| **[Developer Guide](docs/DEVELOPER-GUIDE.md)** | Local setup, deploying apps, managing secrets, troubleshooting | Developers | +| **[Operations Runbook](docs/OPERATIONS-RUNBOOK.md)** | Cluster bootstrap, day-to-day operations, incident response, maintenance | Platform Engineers, SREs | +| **[Technical Reference](docs/REFERENCE.md)** | Component specs, Helm charts, ArgoCD config, Kyverno policies, API docs | Everyone (reference) | + +--- + +## πŸš€ Quick Start + +### For New Developers +```bash +# 1. Clone repositories +git clone https://github.com/snothub/sturdy-adventure.git +git clone git@github.com:fortedigital/helm-values.git + +# 2. Read the guides +# - Start: docs/GITOPS-ARCHITECTURE.md +# - Follow: docs/DEVELOPER-GUIDE.md + +# 3. Deploy your first app (see Developer Guide) +``` + +### For Operators +```bash +# 1. Bootstrap new cluster +./bootstrap.sh + +# 2. Verify deployment +kubectl get applications -n argocd +kubectl get pods --all-namespaces + +# 3. Read Operations Runbook for day-to-day tasks +``` + +--- + +## πŸ“‹ Overview + +This repository contains the complete GitOps configuration for our Kubernetes cluster, using the **App-of-Apps pattern** with ArgoCD. + +### What's Inside + +- **Infrastructure Applications**: Traefik, Cert-Manager, Kyverno, Prometheus, Grafana, Loki, Sealed Secrets +- **Business Applications**: MCP10X, MusicMan, Dot-AI Stack, ArgoCD MCP +- **Policies**: Kyverno security policies for secret management, namespace controls, pod verification +- **Monitoring**: Full observability stack with metrics, logs, and alerting +- **Secrets**: Sealed Secrets for secure Git storage + +### Key Features + +βœ… **GitOps-Native**: Git is the single source of truth +βœ… **Auto-Sync**: Changes automatically deployed (60s reconciliation) +βœ… **Self-Healing**: Manual cluster changes are reverted +βœ… **Multi-Source**: Separate chart templates from configuration +βœ… **Policy Enforcement**: Kyverno ensures security and compliance +βœ… **TLS Everywhere**: Automatic Let's Encrypt certificates +βœ… **Full Observability**: Prometheus, Grafana, Loki integration + +--- + +## πŸ—‚οΈ Repository Structure ``` . -β”œβ”€β”€ bootstrap.sh # Main bootstrap script to initialize ArgoCD and cluster -β”œβ”€β”€ _app-of-apps.yaml # App-of-apps pattern: main ArgoCD Application that manages all other apps -β”œβ”€β”€ apps/ # Business application resources -β”‚ β”œβ”€β”€ feedback-hub.yaml # Feedback Hub test app -β”‚ β”œβ”€β”€ musicman.yaml # Music Man hackathon app -β”‚ └── dot-ai-stack.yaml # dot-ai AI assistant stack -β”œβ”€β”€ infra/ # Individual ArgoCD Application resources for infrastructure -β”‚ β”œβ”€β”€ enterprise-apps.yaml # Enterprise apps: parent Application that syncs everything in "apps" folder -β”‚ β”œβ”€β”€ traefik-application.yaml # Ingress controller (Traefik) -β”‚ β”œβ”€β”€ cert-manager-application.yaml # TLS certificate management -β”‚ β”œβ”€β”€ kyverno.yaml # Policy engine for security -β”‚ β”œβ”€β”€ kyverno-policies.yaml # Kyverno policy definitions -β”‚ β”œβ”€β”€ prometheus.yaml # Metrics & monitoring -β”‚ β”œβ”€β”€ grafana.yaml # Monitoring visualization -β”‚ β”œβ”€β”€ loki.yaml # Log aggregation -β”‚ β”œβ”€β”€ fluent-bit.yaml # Log shipping -β”‚ β”œβ”€β”€ trivy.yaml # Container scanning -β”‚ β”œβ”€β”€ sealedsecrets.yaml # Secret encryption -β”‚ β”œβ”€β”€ cluster-resources-application.yaml # Cluster-wide resources -β”‚ └── values/ # Helm value overrides for ArgoCD and services -β”‚ β”œβ”€β”€ argocd-values.yaml # ArgoCD server configuration -β”‚ β”œβ”€β”€ prometheus-values.yaml -β”‚ β”œβ”€β”€ grafana-values.yaml -β”‚ β”œβ”€β”€ loki-values.yaml -β”‚ └── fluent-bit-values.yaml -└── cluster-resources/ # Cluster-level configurations managed by cluster-resources-application.yaml - β”œβ”€β”€ cert-manager-namespace.yaml - β”œβ”€β”€ secrets-namespace.yaml # Namespace for secrets - β”œβ”€β”€ letsencrypt-issuer.yaml # TLS certificate issuer - β”œβ”€β”€ kyverno-config.yaml # Security policies and secret syncing - β”œβ”€β”€ argocd-notifications-secret-sealed.yaml # Sealed secret for ArgoCD notifications - └── policies/ # Kyverno policy definitions - β”œβ”€β”€ deployment-verifier.yaml # Policy to verify pods have controllers - β”œβ”€β”€ label-checker.yaml # Policy to check labels - β”œβ”€β”€ bare-pod-cleaner.yaml # Policy to clean up pods without controllers - β”œβ”€β”€ replicaset-cleaner.yaml # Policy to clean up orphaned replica sets - β”œβ”€β”€ default-ns-blocker.yaml # Policy to block use of default namespace - └── secret-cloner.yaml # Policy to clone secrets across namespaces +β”œβ”€β”€ bootstrap.sh # Cluster initialization script +β”œβ”€β”€ _app-of-apps.yaml # Root ArgoCD Application (App-of-Apps pattern) +β”‚ +β”œβ”€β”€ infra/ # Infrastructure ArgoCD Applications +β”‚ β”œβ”€β”€ enterprise-apps.yaml # Manages all apps in apps/ folder +β”‚ β”œβ”€β”€ traefik-application.yaml +β”‚ β”œβ”€β”€ cert-manager-application.yaml +β”‚ β”œβ”€β”€ kyverno.yaml +β”‚ β”œβ”€β”€ prometheus.yaml +β”‚ β”œβ”€β”€ grafana.yaml +β”‚ β”œβ”€β”€ loki.yaml +β”‚ β”œβ”€β”€ fluent-bit.yaml +β”‚ β”œβ”€β”€ trivy.yaml +β”‚ β”œβ”€β”€ sealedsecrets.yaml +β”‚ └── values/ # Helm value overrides +β”‚ +β”œβ”€β”€ apps/ # Business Applications +β”‚ β”œβ”€β”€ mcp10x.yaml +β”‚ β”œβ”€β”€ musicman.yaml +β”‚ β”œβ”€β”€ dot-ai-stack.yaml +β”‚ └── argo-mcp.yaml +β”‚ +β”œβ”€β”€ cluster-resources/ # Cluster-wide Kubernetes resources +β”‚ β”œβ”€β”€ letsencrypt-issuer.yaml +β”‚ β”œβ”€β”€ kyverno-config.yaml +β”‚ β”œβ”€β”€ *-sealed.yaml # Sealed secrets +β”‚ └── policies/ # Kyverno policies +β”‚ β”œβ”€β”€ secret-cloner.yaml +β”‚ β”œβ”€β”€ default-ns-blocker.yaml +β”‚ β”œβ”€β”€ bare-pod-cleaner.yaml +β”‚ └── auth-sidecar-injector.yaml +β”‚ +β”œβ”€β”€ secrets/ # Application secrets (sealed) +β”‚ └── *-credentials-sealed.yaml +β”‚ +β”œβ”€β”€ private/ # Local-only files (Git-ignored) +β”‚ └── *.yaml # Unsealed secrets (never committed) +β”‚ +└── docs/ # πŸ“š Comprehensive documentation + β”œβ”€β”€ README.md # Documentation index + β”œβ”€β”€ GITOPS-ARCHITECTURE.md # Architecture guide + β”œβ”€β”€ DEVELOPER-GUIDE.md # Developer onboarding + β”œβ”€β”€ OPERATIONS-RUNBOOK.md # Operations procedures + └── REFERENCE.md # Technical reference ``` -## Architecture & Key Concepts +**See [GitOps Architecture - Repository Structure](docs/GITOPS-ARCHITECTURE.md#repository-structure) for detailed explanation.** -### GitOps Model -- **App-of-Apps Pattern**: `_app-of-apps.yaml` is the root Application that manages all infrastructure applications -- **App-of-Apps Pattern**: `infra/enterprise-apps.yaml` is the main Application that manages all custom applications -- **Source of Truth**: GitHub repository (`https://github.com/snothub/sturdy-adventure.git`) is the single source of truth -- **Auto-sync**: All Applications have automated sync enabled with auto-pruning and self-healing -- **Namespace Creation**: `CreateNamespace=true` allows ArgoCD to create namespaces as needed +--- -### Key Components +## πŸ—οΈ Architecture -1. **Traefik** - Kubernetes Ingress controller for routing external traffic with HTTP/HTTPS redirect -2. **Cert-Manager** - Automates TLS certificate management with Let's Encrypt (see `letsencrypt-issuer.yaml`) -3. **Kyverno** - Policy engine that enforces security rules and syncs secrets across namespaces (via `sync-secret-with-multi-clone` policy) -4. **Monitoring Stack** - Prometheus (metrics) + Grafana (visualization) + Loki (logs) + Fluent-Bit (log shipping) -5. **Trivy** - Container vulnerability scanning -6. **Sealed Secrets** - Encrypts secrets for safe storage in Git +### Three-Repository Pattern -### Secret Management -- **Kyverno ClusterPolicy**: Automatically clones secrets from the `secrets` namespace to new namespaces when they're created -- Only secrets labeled `allowedToBeCloned: "true"` are cloned -- Syncing happens automatically via `synchronize: true` in the policy +| Repository | Purpose | You Edit | +|------------|---------|----------| +| **[sturdy-adventure](https://github.com/snothub/sturdy-adventure.git)** (this repo) | ArgoCD Applications, cluster resources | βœ… Often | +| **[forte-helm](https://github.com/snothub/forte-helm)** | Generic Helm chart templates | ❌ Rarely | +| **[helm-values](git@github.com:fortedigital/helm-values.git)** | App-specific configuration & versions | βœ… Sometimes | -### Network Configuration -- ArgoCD UI: `argocd.127.0.0.1.nip.io` (local development) -- Server runs in insecure mode (`--insecure`, `--disable-auth`) - suitable for local/dev clusters -- Traefik routes to multiple services via Kubernetes Ingress +### GitOps Workflow -## Common Commands +``` +Developer commits code β†’ CI/CD builds image β†’ Updates helm-values β†’ ArgoCD syncs β†’ Deployed to cluster +``` + +**Learn more**: [GitOps Architecture - GitOps Workflow](docs/GITOPS-ARCHITECTURE.md#gitops-workflow) + +--- + +## πŸ”§ Common Tasks + +### Deploy a New Application + +**See detailed guide**: [Developer Guide - Deploying Your First Application](docs/DEVELOPER-GUIDE.md#deploying-your-first-application) + +**Quick version**: +1. Create `apps/myapp.yaml` (ArgoCD Application manifest) +2. Create `helm-values/myapp/values.yaml` (configuration) +3. Create sealed secrets if needed +4. Commit and push - ArgoCD auto-syncs! + +### Update an Existing Application + +**See detailed guide**: [Developer Guide - Updating an Existing Application](docs/DEVELOPER-GUIDE.md#updating-an-existing-application) + +**Quick version**: +- **Update code**: Push to app repo β†’ CI/CD updates image tag in helm-values +- **Update config**: Edit `helm-values/myapp/values.yaml` β†’ commit β†’ push + +### Manage Secrets + +**See detailed guide**: [Developer Guide - Working with Secrets](docs/DEVELOPER-GUIDE.md#working-with-secrets) -### Bootstrap the Cluster ```bash +# Create plain secret +kubectl create secret generic myapp-creds \ + --from-literal=KEY=value \ + --dry-run=client -o yaml > private/myapp-creds.yaml + +# Seal it +kubeseal --format=yaml --cert=pub-cert.pem \ + < private/myapp-creds.yaml > secrets/myapp-creds-sealed.yaml + +# Commit sealed version +git add secrets/myapp-creds-sealed.yaml +git commit -m "Add myapp credentials" +git push +``` + +### Bootstrap Cluster + +**See detailed guide**: [Operations Runbook - Cluster Bootstrap](docs/OPERATIONS-RUNBOOK.md#cluster-bootstrap) + +```bash +# Initialize new cluster ./bootstrap.sh -``` -This runs the `Bootstrap()` function which calls `ArgoCd()` to install ArgoCD using Helm. -### Monitor ArgoCD Applications +# Verify +kubectl get applications -n argocd +kubectl get pods --all-namespaces +``` + +--- + +## πŸ› οΈ Quick Reference + +### Monitor Applications + ```bash -# View all ArgoCD applications +# List all ArgoCD applications kubectl get applications -n argocd # Watch sync status kubectl get applications -n argocd -w -# Describe a specific application -kubectl describe app -n argocd +# Check specific application +kubectl describe application myapp -n argocd + +# View application logs +kubectl logs -n myapp ``` -### Manage ArgoCD +### Access UIs + ```bash -# Port forward to access UI +# ArgoCD UI kubectl port-forward svc/argocd-server -n argocd 8080:443 +# Access: https://localhost:8080 (no auth required) -# Access at: https://localhost:8080 (admin auth disabled in dev) +# Grafana +kubectl port-forward -n monitoring svc/grafana 3000:80 +# Access: http://localhost:3000 + +# Prometheus +kubectl port-forward -n monitoring svc/prometheus-server 9090:80 +# Access: http://localhost:9090 ``` -### Check Secret Syncing +### Troubleshooting + ```bash -# Verify Kyverno policy is applied -kubectl get clusterpolicy sync-secret-with-multi-clone +# Check pod status +kubectl get pods -n myapp -# Check if secrets are synced to a namespace -kubectl get secrets -n +# View pod logs +kubectl logs -n myapp + +# Check pod events +kubectl describe pod -n myapp + +# Check ArgoCD sync errors +kubectl describe application myapp -n argocd + +# Force sync +kubectl patch application myapp -n argocd \ + --type merge -p '{"metadata":{"annotations":{"argocd.argoproj.io/refresh":"hard"}}}' ``` -### Deploy Changes -- Changes to YAML files in `apps/`, `infra/`, `**/values/`, or `cluster-resources/` are automatically synced by ArgoCD -- Push changes to the GitHub repository for them to be reflected -- ArgoCD reconciliation happens every 60s (`timeout.reconciliation: 60s`) -- Each application has a 5-minute sync timeout to prevent stalled deployments +**Full troubleshooting guide**: [Developer Guide - Troubleshooting](docs/DEVELOPER-GUIDE.md#troubleshooting) -### Review Helm Values -Application-specific Helm value overrides are in `**/values/` and referenced within each Application's Helm configuration. Each application manifest uses both external value files and inline overrides where needed. +--- -### Application Organization & Sync Ordering -- Infrastructure applications use `argocd.argoproj.io/sync-wave` annotations for ordered deployment -- Kyverno (sync-wave: 0) deploys before cluster-resources (sync-wave: 1) to ensure policies are ready -- All applications have resource requests and limits configured to prevent resource starvation -- Applications are labeled with `app.kubernetes.io/part-of` to indicate their component type (platform, monitoring-stack, application) +## πŸ” Security -## Important Notes +### Secret Management +- βœ… Sealed Secrets for Git storage +- βœ… Kyverno auto-clones secrets to namespaces +- ❌ Never commit plain secrets -- **No admin auth in development**: ArgoCD has `admin.enabled: "false"` - suitable for local/dev only -- **Insecure server mode**: `--insecure` and `--disable-auth` flags are set - not for production -- **Folder organization**: - - `infra/` contains infrastructure/platform components (Traefik, Cert-Manager, Prometheus, Grafana, Loki, etc.) - - `apps/` is reserved for business applications (currently empty) -- **Replica counts**: Traefik runs 2 replicas; other services run 1 replica -- **Retry policy**: All applications retry up to 5 times with exponential backoff (max 3m timeout per application) -- **Ignore replica scaling**: Deployments ignore replica count differences to allow HPA/manual scaling -- **Sync validation**: All applications validate manifests before applying (`Validate=true`) -- **Server-side apply**: All applications use `ServerSideApply=true` for safer field ownership tracking +### Network Security +- βœ… All traffic TLS-encrypted (Let's Encrypt) +- βœ… HTTP β†’ HTTPS redirect +- βœ… Traefik IngressRoute per application -## Development Tips +### Policy Enforcement +- βœ… Kyverno policies for security +- βœ… Default namespace blocked +- βœ… Bare pods not allowed +- βœ… Optional authentication sidecar injection -- **Check ArgoCD logs**: `kubectl logs -n argocd deployment/argocd-application-controller` -- **Validate YAML**: Files are validated server-side (`Validate=true`) before applying -- **Resource tracking**: Uses annotation-based method (`application.resourceTrackingMethod: annotation`) -- **Modify applications**: Edit the corresponding YAML in `infra/` and push to trigger sync -- **Add new services**: Create a new Application YAML in `apps/` following the pattern of existing ones, then it will be auto-discovered by the app-of-apps -- **Application folder naming**: Infrastructure components are in `infra/`; `apps/` is reserved for business applications +**Learn more**: [GitOps Architecture - Security Model](docs/GITOPS-ARCHITECTURE.md#security-model) + +--- + +## πŸ“Š Infrastructure Components + +| Component | Purpose | Namespace | Replicas | +|-----------|---------|-----------|----------| +| **ArgoCD** | GitOps controller | `argocd` | 1 | +| **Traefik** | Ingress controller | `traefik` | 2 | +| **Cert-Manager** | TLS certificates | `cert-manager` | 1 | +| **Kyverno** | Policy engine | `kyverno` | 1 | +| **Sealed Secrets** | Secret encryption | `kube-system` | 1 | +| **Prometheus** | Metrics | `monitoring` | 1 | +| **Grafana** | Dashboards | `monitoring` | 1 | +| **Loki** | Logs | `monitoring` | 1 | +| **Fluent-Bit** | Log shipping | `monitoring` | DaemonSet | +| **Trivy** | Vulnerability scanning | `trivy-system` | 1 | + +**Full specs**: [Technical Reference - Infrastructure Components](docs/REFERENCE.md#infrastructure-components) + +--- + +## 🌐 Domains & Networking + +- **Local development**: `*.127.0.0.1.nip.io` +- **Production**: `*.forteapps.net` +- **DNS**: Manual configuration (contact platform team) +- **TLS**: Automatic via Let's Encrypt + +--- + +## πŸ“– Key Concepts + +### App-of-Apps Pattern +`_app-of-apps.yaml` is the root Application that manages all other Applications in `infra/`. Each YAML in `infra/` becomes a child Application managed by ArgoCD. + +### Multi-Source Pattern +Applications reference both: +1. **Helm charts** from `forte-helm` (templates) +2. **Values** from `helm-values` (configuration) + +This separates reusable templates from environment-specific config. + +### Sync Waves +Applications deploy in order using `argocd.argoproj.io/sync-wave`: +- Wave `-1`: Namespaces +- Wave `0`: Kyverno (policies) +- Wave `1`: Infrastructure +- Wave `2+`: Applications + +### Auto-Sync & Self-Heal +- **Auto-Sync**: ArgoCD automatically deploys Git changes (60s polling) +- **Self-Heal**: Manual cluster changes are reverted to match Git +- **Prune**: Deleted resources in Git are removed from cluster + +**Learn more**: [GitOps Architecture - GitOps Workflow](docs/GITOPS-ARCHITECTURE.md#gitops-workflow) + +--- + +## βš™οΈ Configuration + +### ArgoCD Settings +- **Reconciliation**: Every 60 seconds +- **Sync timeout**: 5 minutes per application +- **Retry policy**: 5 attempts with exponential backoff +- **Authentication**: Disabled (internal use only) + +### Application Defaults +- **Auto-sync**: Enabled +- **Self-heal**: Enabled +- **Prune**: Enabled +- **Validation**: Server-side validation enabled +- **Server-side apply**: Enabled + +**Full configuration**: [Technical Reference - ArgoCD Configuration](docs/REFERENCE.md#argocd-configuration) + +--- + +## πŸ†˜ Getting Help + +### Documentation +1. **Start here**: [Documentation Index](docs/README.md) +2. **For development**: [Developer Guide](docs/DEVELOPER-GUIDE.md) +3. **For operations**: [Operations Runbook](docs/OPERATIONS-RUNBOOK.md) +4. **For reference**: [Technical Reference](docs/REFERENCE.md) + +### Support +- **Slack**: #platform-support +- **Issues**: Contact platform team +- **Emergencies**: Escalate via Slack + +### Common Questions + +| Question | Answer | +|----------|--------| +| How do I deploy an app? | [Developer Guide - Deploying Your First Application](docs/DEVELOPER-GUIDE.md#deploying-your-first-application) | +| How do I manage secrets? | [Developer Guide - Working with Secrets](docs/DEVELOPER-GUIDE.md#working-with-secrets) | +| App won't sync? | [Developer Guide - Troubleshooting](docs/DEVELOPER-GUIDE.md#troubleshooting) | +| How do I bootstrap a cluster? | [Operations Runbook - Cluster Bootstrap](docs/OPERATIONS-RUNBOOK.md#cluster-bootstrap) | +| Where are the logs? | [Operations Runbook - Monitoring & Alerting](docs/OPERATIONS-RUNBOOK.md#monitoring--alerting) | + +--- + +## 🀝 Contributing + +### Adding a New Application +1. Read [Developer Guide - Deploying Your First Application](docs/DEVELOPER-GUIDE.md#deploying-your-first-application) +2. Create ArgoCD Application manifest in `apps/` +3. Create Helm values in `helm-values/` +4. Create sealed secrets if needed +5. Commit and push - ArgoCD handles the rest! + +### Modifying Infrastructure +1. Read [Operations Runbook](docs/OPERATIONS-RUNBOOK.md) +2. Update relevant files in `infra/` or `cluster-resources/` +3. Test changes in isolated namespace if possible +4. Commit and push +5. Monitor sync status in Slack/ArgoCD UI + +### Updating Documentation +Documentation lives in `docs/`. To update: +1. Edit relevant markdown file +2. Update "Last Updated" date +3. Submit PR or push directly +4. Notify team of significant changes + +--- + +## πŸ“ Notes + +### Current Environment +- **Provider**: UpCloud Managed Kubernetes +- **Environment**: Production (internal use only) +- **Cluster**: Single cluster +- **Auth**: Disabled for ArgoCD (internal access) +- **Backup**: None (cluster rebuildable via GitOps) + +### Known Limitations +- No automated backups (yet) +- Secret rotation not automated +- Single cluster (no multi-cluster setup) +- DNS management is manual + +**Future improvements**: See [Operations Runbook - Disaster Recovery](docs/OPERATIONS-RUNBOOK.md#disaster-recovery) + +--- + +## πŸ“š Additional Resources + +### External Documentation +- [ArgoCD Documentation](https://argo-cd.readthedocs.io/) +- [Kyverno Documentation](https://kyverno.io/docs/) +- [Traefik Documentation](https://doc.traefik.io/traefik/) +- [Cert-Manager Documentation](https://cert-manager.io/docs/) +- [Sealed Secrets](https://github.com/bitnami-labs/sealed-secrets) + +### Related Repositories +- [forte-helm](https://github.com/snothub/forte-helm) - Helm chart templates +- [helm-values](git@github.com:fortedigital/helm-values.git) - Application values + +--- + +## πŸ“„ License + +Internal use only. Not for public distribution. + +--- + +## πŸ‘₯ Maintainers + +**Platform Team** +- Contact: #platform-support on Slack +- Issues: Create issue in repository or contact team directly + +--- + +**Last Updated**: 2026-03-16 +**Documentation Version**: 1.0.0 + +**πŸš€ Ready to get started? Check out the [Documentation Index](docs/README.md)!** diff --git a/docs/DEVELOPER-GUIDE.md b/docs/DEVELOPER-GUIDE.md new file mode 100644 index 0000000..15953e3 --- /dev/null +++ b/docs/DEVELOPER-GUIDE.md @@ -0,0 +1,1089 @@ +# Developer Onboarding Guide + +## Table of Contents +- [Getting Started](#getting-started) +- [Prerequisites](#prerequisites) +- [Local Development Setup](#local-development-setup) +- [Understanding the Workflow](#understanding-the-workflow) +- [Deploying Your First Application](#deploying-your-first-application) +- [Updating an Existing Application](#updating-an-existing-application) +- [Working with Secrets](#working-with-secrets) +- [Troubleshooting](#troubleshooting) +- [Best Practices](#best-practices) + +--- + +## Getting Started + +Welcome! This guide will help you understand how to develop and deploy applications on our Kubernetes cluster using GitOps principles powered by ArgoCD. + +### What You'll Learn +- How our GitOps architecture works +- How to deploy a new application +- How to update existing applications +- How to manage secrets securely +- Common troubleshooting techniques + +### Who This Guide Is For +- Developers deploying new applications +- Developers maintaining existing applications +- Team members who need to understand the deployment process + +--- + +## Prerequisites + +### Required Knowledge +- βœ… Basic Git workflow (clone, commit, push, pull) +- βœ… Docker basics (Dockerfile, building images) +- βœ… YAML syntax +- βœ… Basic understanding of Kubernetes concepts (pods, deployments, services) +- ⚠️ Helm knowledge (helpful but not required - templates are provided) + +### Required Tools + +Most developers **do NOT need kubectl access** to the cluster. You'll primarily work with Git repositories. + +If you do need cluster access, install: + +1. **kubectl** - Kubernetes CLI + ```bash + # macOS + brew install kubectl + + # Windows + choco install kubernetes-cli + + # Linux + curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl" + ``` + +2. **kubeseal** - For sealing secrets + ```bash + # macOS + brew install kubeseal + + # Windows + choco install kubeseal + + # Linux + wget https://github.com/bitnami-labs/sealed-secrets/releases/download/v0.24.0/kubeseal-0.24.0-linux-amd64.tar.gz + tar -xvzf kubeseal-0.24.0-linux-amd64.tar.gz + sudo mv kubeseal /usr/local/bin/ + ``` + +3. **Git** - Version control + ```bash + git --version # Should already be installed + ``` + +4. **Docker** - For local development + ```bash + # macOS/Windows: Install Docker Desktop + # Linux: Install Docker Engine + docker --version + ``` + +### Repository Access + +You'll need read/write access to these repositories: + +1. **sturdy-adventure** (Config repo) + ```bash + git clone https://github.com/snothub/sturdy-adventure.git + cd sturdy-adventure + ``` + +2. **helm-values** (Values repo) + ```bash + git clone git@github.com:fortedigital/helm-values.git + cd helm-values + ``` + +3. **forte-helm** (Chart repo - read-only for most developers) + ```bash + git clone https://github.com/snothub/forte-helm.git + cd forte-helm + ``` + +### Cluster Access (If Needed) + +If you need kubectl access, ask the platform team for: +- Kubeconfig file +- Cluster context setup instructions + +Save to `~/.kube/config` and verify: +```bash +kubectl cluster-info +kubectl get nodes +``` + +--- + +## Local Development Setup + +### 1. Clone the Repositories + +Set up a consistent folder structure: + +```bash +mkdir -p ~/dev/k8s +cd ~/dev/k8s + +# Clone repositories +git clone https://github.com/snothub/sturdy-adventure.git launchpad +git clone git@github.com:fortedigital/helm-values.git helm-prod-values +git clone https://github.com/snothub/forte-helm.git forte-helm + +# Your folder structure: +# ~/dev/k8s/ +# β”œβ”€β”€ launchpad/ (Config repo) +# β”œβ”€β”€ helm-prod-values/ (Values repo) +# └── forte-helm/ (Chart repo) +``` + +### 2. Local Development Environment + +Most applications use **Docker Compose** for local development: + +```bash +# In your application repository +docker-compose up + +# Or for frontend applications +npm install +npm run dev +``` + +**You DO NOT run applications locally on Kubernetes.** Use Docker Compose or native tooling (npm, python, etc.). + +### 3. Understanding the Deployment Flow + +``` +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ Step 1: Develop Locally β”‚ +β”‚ - Write code in your application repository β”‚ +β”‚ - Test with Docker Compose or npm/python/etc. β”‚ +β”‚ - Build Docker image β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ + β–Ό +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ Step 2: CI/CD Pipeline (Automated) β”‚ +β”‚ - GitHub Actions builds image β”‚ +β”‚ - Pushes to container registry (GHCR, Docker Hub) β”‚ +β”‚ - Tags with version (e.g., v2.0.4) β”‚ +β”‚ - Updates helm-values repository with new tag β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ + β–Ό +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ Step 3: GitOps Sync (Automated) β”‚ +β”‚ - ArgoCD detects change in helm-values β”‚ +β”‚ - Pulls updated configuration β”‚ +β”‚ - Syncs to Kubernetes cluster β”‚ +β”‚ - Sends Slack notification on success/failure β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ +``` + +**Key Insight**: You don't deploy directly. You push code, CI/CD builds it, and ArgoCD deploys it. + +--- + +## Understanding the Workflow + +### Three-Repository Pattern + +Our setup uses three repositories: + +| Repository | Purpose | You Edit | +|------------|---------|----------| +| **forte-helm** | Helm chart templates (generic, reusable) | ❌ Rarely | +| **helm-values** | Application configuration (image tag, env vars) | βœ… Sometimes | +| **sturdy-adventure** | ArgoCD Applications (what gets deployed) | βœ… Yes (for new apps) | + +### Example: Deploying "myapp" + +#### Repository: `forte-helm` (Chart Templates) +```yaml +# forteapp/templates/deployment.yaml +# Generic template used by ALL apps +apiVersion: apps/v1 +kind: Deployment +metadata: + name: {{ .Values.app.name }} +spec: + containers: + - name: app + image: "{{ .Values.app.image.repository }}:{{ .Values.app.image.tag }}" + env: + - name: PORT + value: {{ .Values.app.port }} +``` + +#### Repository: `helm-values` (Your App Config) +```yaml +# myapp/values.yaml +# Your app's specific configuration +app: + image: + repository: ghcr.io/fortedigital/myapp + tag: v1.0.0 # CI/CD updates this + port: 3000 + extraEnv: + - name: API_URL + value: https://api.example.com +``` + +#### Repository: `sturdy-adventure` (ArgoCD Application) +```yaml +# apps/myapp.yaml +# Tells ArgoCD to deploy your app +apiVersion: argoproj.io/v1alpha1 +kind: Application +metadata: + name: myapp + namespace: argocd +spec: + sources: + - repoURL: https://github.com/snothub/forte-helm + path: forteapp + helm: + valueFiles: + - $values/myapp/values.yaml + + - repoURL: git@github.com:fortedigital/helm-values.git + ref: values + + destination: + server: https://kubernetes.default.svc + namespace: myapp + + syncPolicy: + automated: + prune: true + selfHeal: true + syncOptions: + - CreateNamespace=true +``` + +--- + +## Deploying Your First Application + +### Scenario: You've Built a New Application + +Let's deploy a new Node.js application called "hello-world". + +### Step 1: Prepare Your Application Repository + +Ensure your app repository has: + +1. **Dockerfile** + ```dockerfile + FROM node:18-alpine + WORKDIR /app + COPY package*.json ./ + RUN npm ci --only=production + COPY . . + EXPOSE 3000 + CMD ["node", "server.js"] + ``` + +2. **GitHub Actions Workflow** (`.github/workflows/deploy.yml`) + ```yaml + name: Build and Deploy + + on: + push: + branches: [ main ] + + jobs: + build: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v3 + + - name: Set version + id: version + run: echo "VERSION=v$(date +%Y%m%d-%H%M%S)" >> $GITHUB_OUTPUT + + - name: Build and push Docker image + run: | + echo ${{ secrets.GITHUB_TOKEN }} | docker login ghcr.io -u ${{ github.actor }} --password-stdin + docker build -t ghcr.io/fortedigital/hello-world:${{ steps.version.outputs.VERSION }} . + docker push ghcr.io/fortedigital/hello-world:${{ steps.version.outputs.VERSION }} + + - name: Update helm-values + run: | + git clone git@github.com:fortedigital/helm-values.git + cd helm-values + mkdir -p hello-world + cat > hello-world/values.yaml < private/myapp-credentials.yaml +``` + +**DO NOT commit this file!** It's in `private/` which is Git-ignored. + +#### Step 2: Seal the Secret + +Get the public certificate (one-time setup): + +```bash +# Fetch public cert from cluster +kubeseal --fetch-cert \ + --controller-name=sealed-secrets-controller \ + --controller-namespace=kube-system \ + > pub-cert.pem +``` + +Seal your secret: + +```bash +kubeseal --format=yaml \ + --cert=pub-cert.pem \ + < private/myapp-credentials.yaml \ + > secrets/myapp-credentials-sealed.yaml +``` + +#### Step 3: Commit Sealed Secret + +```bash +git add secrets/myapp-credentials-sealed.yaml +git commit -m "Add myapp credentials (sealed)" +git push +``` + +#### Step 4: Reference Secret in Application + +Update your `helm-values/myapp/values.yaml`: + +```yaml +app: + envSecretName: "myapp-credentials" # References the SealedSecret +``` + +Commit and push: +```bash +cd ~/dev/k8s/helm-prod-values +git add myapp/values.yaml +git commit -m "Reference myapp credentials" +git push +``` + +### Updating a Secret + +To update an existing secret: + +```bash +# 1. Create new version of secret +kubectl create secret generic myapp-credentials \ + --from-literal=API_KEY=new-key-here \ + --from-literal=DB_PASSWORD=new-password \ + --dry-run=client -o yaml > private/myapp-credentials.yaml + +# 2. Seal it +kubeseal --format=yaml \ + --cert=pub-cert.pem \ + < private/myapp-credentials.yaml \ + > secrets/myapp-credentials-sealed.yaml + +# 3. Commit sealed version +git add secrets/myapp-credentials-sealed.yaml +git commit -m "Update myapp credentials" +git push + +# 4. Restart pods to pick up new secret +kubectl rollout restart deployment myapp -n myapp +``` + +### Secret Best Practices + +βœ… **DO**: +- Store secrets in `private/` folder locally +- Always seal secrets before committing +- Delete plain secrets after sealing +- Use meaningful secret names +- Document what each secret contains + +❌ **DON'T**: +- Commit plain secrets to Git +- Share secrets via Slack/email +- Hard-code secrets in code +- Use the same secret across multiple environments +- Store secrets in Docker images + +### Where Secrets Are Stored + +``` +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ Location β”‚ Content β”‚ Committed?β”‚ +β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ +β”‚ private/ β”‚ Plain secrets β”‚ ❌ NO β”‚ +β”‚ secrets/ β”‚ Sealed secrets β”‚ βœ… YES β”‚ +β”‚ Kubernetes cluster β”‚ Unsealed secrets β”‚ N/A β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ +``` + +**Sealed Secrets Controller** in the cluster decrypts sealed secrets automatically. + +--- + +## Troubleshooting + +### Application Not Deploying + +#### Problem: Application stuck in "Syncing" state + +**Check ArgoCD status:** +```bash +kubectl get application myapp -n argocd -o yaml +``` + +Look for errors in `status.conditions`. + +**Common causes:** +- ❌ Image doesn't exist or is not accessible +- ❌ Invalid YAML syntax +- ❌ Resource quota exceeded +- ❌ Namespace conflicts +- ❌ Invalid Helm values + +**Solutions:** +```bash +# Check image exists +docker pull ghcr.io/fortedigital/myapp:v1.0.0 + +# Validate YAML syntax +kubectl apply --dry-run=client -f apps/myapp.yaml + +# Check ArgoCD logs +kubectl logs -n argocd deployment/argocd-application-controller | grep myapp +``` + +#### Problem: Pods crashing (CrashLoopBackOff) + +**Check pod logs:** +```bash +kubectl get pods -n myapp +kubectl logs -n myapp +kubectl describe pod -n myapp +``` + +**Common causes:** +- ❌ Application error (check logs) +- ❌ Missing environment variables +- ❌ Incorrect port configuration +- ❌ Missing secrets +- ❌ Insufficient resources + +**Solutions:** +```bash +# Check environment variables +kubectl exec -n myapp -- env + +# Check if secrets exist +kubectl get secrets -n myapp + +# Increase resources in helm-values +vim ~/dev/k8s/helm-prod-values/myapp/values.yaml +``` + +#### Problem: Application not accessible via domain + +**Check ingress:** +```bash +kubectl get ingressroute -n myapp +kubectl describe ingressroute myapp -n myapp +``` + +**Common causes:** +- ❌ DNS not configured +- ❌ TLS certificate not issued +- ❌ Incorrect domain in values.yaml +- ❌ Traefik not routing correctly + +**Solutions:** +```bash +# Check certificate +kubectl get certificate -n myapp + +# Check cert-manager logs +kubectl logs -n cert-manager deployment/cert-manager + +# Verify domain configuration +cat ~/dev/k8s/helm-prod-values/myapp/values.yaml | grep host + +# Test with port-forward +kubectl port-forward -n myapp service/myapp 8080:3000 +curl http://localhost:8080 +``` + +### Secret Issues + +#### Problem: Secret not found + +**Check if SealedSecret exists:** +```bash +kubectl get sealedsecret -n myapp +kubectl get secret -n myapp +``` + +**Solutions:** +```bash +# Check if secret is in Git +ls -l secrets/myapp-credentials-sealed.yaml + +# Re-apply sealed secret +kubectl apply -f secrets/myapp-credentials-sealed.yaml + +# Check sealed-secrets-controller logs +kubectl logs -n kube-system deployment/sealed-secrets-controller +``` + +#### Problem: Secret exists but pods can't access it + +**Check pod events:** +```bash +kubectl describe pod -n myapp +``` + +Look for: `Error: secret "myapp-credentials" not found` + +**Solutions:** +```bash +# Verify secret name in values.yaml matches actual secret +cat ~/dev/k8s/helm-prod-values/myapp/values.yaml | grep envSecretName +kubectl get secrets -n myapp + +# Restart pods +kubectl rollout restart deployment myapp -n myapp +``` + +### Sync Failures + +#### Problem: ArgoCD shows "Out of Sync" + +**Manual sync:** +```bash +# Using kubectl +kubectl patch application myapp -n argocd --type merge -p '{"operation":{"initiatedBy":{"username":"admin"},"sync":{"syncStrategy":{"hook":{}}}}}' + +# Or via ArgoCD UI +# Click "Sync" button in UI +``` + +**Check what's different:** +```bash +kubectl get application myapp -n argocd -o yaml +``` + +Look at `status.sync.comparedTo` vs desired state. + +#### Problem: Sync succeeds but application is "Degraded" + +**Check resource health:** +```bash +kubectl get application myapp -n argocd -o jsonpath='{.status.resources[*].health}' +``` + +**Common causes:** +- ❌ Pods not ready +- ❌ Deployments not at desired replica count +- ❌ Jobs failed + +**Solutions:** +```bash +# Check all resources in namespace +kubectl get all -n myapp + +# Check pod events +kubectl get events -n myapp --sort-by='.lastTimestamp' +``` + +### Getting Help + +If you're stuck: + +1. **Check Slack notifications** - Error details are often in sync failure messages +2. **Check ArgoCD UI** - Visual representation of what's wrong +3. **Ask platform team** - They have full cluster access and can debug further +4. **Check documentation** - [Operations Runbook](OPERATIONS-RUNBOOK.md) has more troubleshooting + +--- + +## Best Practices + +### Development Workflow + +βœ… **DO**: +- Develop and test locally with Docker Compose +- Use semantic versioning for releases +- Write descriptive commit messages +- Test changes in a separate namespace first (if possible) +- Monitor Slack for deployment notifications +- Document environment variables and configuration + +❌ **DON'T**: +- Push directly to production without testing +- Use `latest` tag for Docker images +- Bypass CI/CD for "quick fixes" +- Hard-code configuration values +- Ignore deployment failures + +### Configuration Management + +βœ… **DO**: +- Keep configuration in `helm-values` repository +- Use environment variables for config +- Document what each value does +- Use reasonable resource limits +- Enable ingress and TLS for public services + +❌ **DON'T**: +- Hard-code config in application code +- Over-allocate resources (wastes money) +- Under-allocate resources (causes crashes) +- Use HTTP for production services + +### Secret Management + +βœ… **DO**: +- Use kubeseal for all secrets +- Store plain secrets in password manager +- Rotate secrets regularly +- Use different secrets per environment +- Document what each secret contains + +❌ **DON'T**: +- Commit plain secrets +- Share secrets in Slack/email +- Reuse secrets across apps +- Log secrets in application code + +### Git Workflow + +βœ… **DO**: +- Use feature branches for changes +- Write clear commit messages +- Use pull requests for review +- Keep commits atomic and focused +- Tag releases in application repos + +❌ **DON'T**: +- Push directly to `main` without review (for config repos) +- Make multiple unrelated changes in one commit +- Use vague commit messages ("fix", "update") +- Force-push to main branches + +--- + +## Quick Reference + +### Common Commands + +```bash +# Check application status +kubectl get application myapp -n argocd + +# View application details +kubectl describe application myapp -n argocd + +# Check pods +kubectl get pods -n myapp + +# View pod logs +kubectl logs -n myapp + +# Restart deployment +kubectl rollout restart deployment myapp -n myapp + +# Port-forward to service +kubectl port-forward -n myapp service/myapp 8080:3000 + +# Create secret +kubectl create secret generic myapp-credentials \ + --from-literal=KEY=value \ + --dry-run=client -o yaml > private/myapp-credentials.yaml + +# Seal secret +kubeseal --format=yaml \ + --cert=pub-cert.pem \ + < private/myapp-credentials.yaml \ + > secrets/myapp-credentials-sealed.yaml +``` + +### Repository Locations + +```bash +# Config repository +cd ~/dev/k8s/launchpad + +# Helm values repository +cd ~/dev/k8s/helm-prod-values + +# Helm charts repository +cd ~/dev/k8s/forte-helm +``` + +### File Paths + +```bash +# New application manifest +~/dev/k8s/launchpad/apps/myapp.yaml + +# Application values +~/dev/k8s/helm-prod-values/myapp/values.yaml + +# Sealed secrets +~/dev/k8s/launchpad/secrets/myapp-credentials-sealed.yaml + +# Plain secrets (local only) +~/dev/k8s/launchpad/private/myapp-credentials.yaml +``` + +--- + +## Next Steps + +Now that you understand the basics: + +1. βœ… Deploy your first application (follow steps above) +2. πŸ“– Read the [Operations Runbook](OPERATIONS-RUNBOOK.md) for common tasks +3. πŸ“– Review [Technical Reference](REFERENCE.md) for detailed component docs +4. πŸ“– Understand [GitOps Architecture](GITOPS-ARCHITECTURE.md) for the big picture +5. πŸš€ Start contributing! + +--- + +**Questions?** +- Slack: #platform-support +- Docs: [Full documentation index](README.md) +- Help: Contact platform team + +**Last Updated**: 2026-03-16 diff --git a/docs/GITOPS-ARCHITECTURE.md b/docs/GITOPS-ARCHITECTURE.md new file mode 100644 index 0000000..203dfc8 --- /dev/null +++ b/docs/GITOPS-ARCHITECTURE.md @@ -0,0 +1,640 @@ +# GitOps Architecture & Repository Guide + +## Table of Contents +- [Overview](#overview) +- [Architecture Diagram](#architecture-diagram) +- [Repository Structure](#repository-structure) +- [GitOps Workflow](#gitops-workflow) +- [CI/CD Pipeline](#cicd-pipeline) +- [Security Model](#security-model) + +--- + +## Overview + +This Kubernetes cluster uses a **GitOps approach** powered by **ArgoCD**, where Git repositories serve as the single source of truth for both infrastructure and application deployments. The cluster is running on **UpCloud Managed Kubernetes** but is designed to be cloud-agnostic. + +### Key Characteristics +- **Environment**: Production (internal use only) +- **Cluster Type**: Single cluster, single environment +- **GitOps Tool**: ArgoCD +- **Deployment Pattern**: App-of-Apps +- **Secret Management**: Sealed Secrets (kubeseal) +- **Ingress**: Traefik with Let's Encrypt TLS +- **Monitoring**: Prometheus + Grafana + Loki + Fluent-Bit +- **Policy Engine**: Kyverno +- **Notifications**: Slack integration for sync status + +--- + +## Architecture Diagram + +``` +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ Developer Workflow β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ + β–Ό +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ Application Code β”‚ β”‚ Helm Charts β”‚ β”‚ Helm Values β”‚ +β”‚ Repositories │──────│ Repository │──────│ Repository β”‚ +β”‚ (Source Code) β”‚ β”‚ (Templates) β”‚ β”‚ (Config/Env) β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ β”‚ β”‚ + β”‚ β”‚ β”‚ + GitHub Actions β”‚ β”‚ + Build & Push Image β”‚ β”‚ + β”‚ β”‚ β”‚ + β”‚ β”‚ β”‚ + └────────► Update image tag β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + in helm-values β”‚ + β”‚ + β–Ό + β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” + β”‚ Config Repository β”‚ + β”‚ (ArgoCD Applications) β”‚ + β”‚ github.com/snothub/ β”‚ + β”‚ sturdy-adventure β”‚ + β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ + β”‚ + ArgoCD monitors & syncs + β”‚ + β–Ό + β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” + β”‚ Kubernetes Cluster β”‚ + β”‚ (UpCloud Managed) β”‚ + β”‚ β”‚ + β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ + β”‚ β”‚ ArgoCD β”‚ β”‚ + β”‚ β”‚ (GitOps Controller) β”‚ β”‚ + β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ + β”‚ β”‚ + β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ + β”‚ β”‚ Infrastructure Layer β”‚ β”‚ + β”‚ β”‚ - Traefik (Ingress) β”‚ β”‚ + β”‚ β”‚ - Cert-Manager (TLS) β”‚ β”‚ + β”‚ β”‚ - Kyverno (Policies) β”‚ β”‚ + β”‚ β”‚ - Sealed Secrets β”‚ β”‚ + β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ + β”‚ β”‚ + β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ + β”‚ β”‚ Monitoring Stack β”‚ β”‚ + β”‚ β”‚ - Prometheus β”‚ β”‚ + β”‚ β”‚ - Grafana β”‚ β”‚ + β”‚ β”‚ - Loki β”‚ β”‚ + β”‚ β”‚ - Fluent-Bit β”‚ β”‚ + β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ + β”‚ β”‚ + β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ + β”‚ β”‚ Application Layer β”‚ β”‚ + β”‚ β”‚ - mcp10x β”‚ β”‚ + β”‚ β”‚ - musicman β”‚ β”‚ + β”‚ β”‚ - dot-ai-stack β”‚ β”‚ + β”‚ β”‚ - argo-mcp β”‚ β”‚ + β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ + β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ + β”‚ + β–Ό + β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” + β”‚ Slack Channel β”‚ + β”‚ (Notifications) β”‚ + β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ +``` + +--- + +## Repository Structure + +### 1. **Config Repository** (Current Repo) +**Repository**: `https://github.com/snothub/sturdy-adventure.git` +**Purpose**: GitOps configuration - ArgoCD Applications and cluster resources +**Location**: `C:\dev\k8s\launchpad` + +``` +sturdy-adventure/ +β”œβ”€β”€ bootstrap.sh # Cluster initialization script +β”œβ”€β”€ _app-of-apps.yaml # Root ArgoCD Application (App-of-Apps pattern) +β”‚ +β”œβ”€β”€ infra/ # Infrastructure ArgoCD Applications +β”‚ β”œβ”€β”€ enterprise-apps.yaml # Parent app managing all apps in apps/ +β”‚ β”œβ”€β”€ cluster-resources-application.yaml +β”‚ β”œβ”€β”€ traefik-application.yaml +β”‚ β”œβ”€β”€ cert-manager-application.yaml +β”‚ β”œβ”€β”€ kyverno.yaml +β”‚ β”œβ”€β”€ kyverno-policies.yaml +β”‚ β”œβ”€β”€ prometheus.yaml +β”‚ β”œβ”€β”€ grafana.yaml +β”‚ β”œβ”€β”€ loki.yaml +β”‚ β”œβ”€β”€ fluent-bit.yaml +β”‚ β”œβ”€β”€ trivy.yaml +β”‚ β”œβ”€β”€ sealedsecrets.yaml +β”‚ β”œβ”€β”€ secrets.yaml +β”‚ └── values/ # Helm value overrides for infra +β”‚ β”œβ”€β”€ argocd-values.yaml +β”‚ β”œβ”€β”€ prometheus-values.yaml +β”‚ β”œβ”€β”€ grafana-values.yaml +β”‚ β”œβ”€β”€ loki-values.yaml +β”‚ └── fluent-bit-values.yaml +β”‚ +β”œβ”€β”€ apps/ # Business Application ArgoCD manifests +β”‚ β”œβ”€β”€ mcp10x.yaml # MCP 10X application +β”‚ β”œβ”€β”€ musicman.yaml # Music Man application +β”‚ β”œβ”€β”€ dot-ai-stack.yaml # Dot AI Stack +β”‚ └── argo-mcp.yaml # ArgoCD MCP server +β”‚ +β”œβ”€β”€ cluster-resources/ # Cluster-wide Kubernetes resources +β”‚ β”œβ”€β”€ cert-manager-namespace.yaml +β”‚ β”œβ”€β”€ secrets-namespace.yaml +β”‚ β”œβ”€β”€ letsencrypt-issuer.yaml # Let's Encrypt ClusterIssuer +β”‚ β”œβ”€β”€ kyverno-config.yaml +β”‚ β”œβ”€β”€ argocd-notifications-secret-sealed.yaml +β”‚ β”œβ”€β”€ snothub-repo-credentials-sealed.yaml +β”‚ β”œβ”€β”€ forte10x-repo-credentials-sealed.yaml +β”‚ β”œβ”€β”€ mcp10x-repo-credentials-sealed.yaml +β”‚ └── policies/ # Kyverno policies +β”‚ β”œβ”€β”€ deployment-verifier.yaml +β”‚ β”œβ”€β”€ label-checker.yaml +β”‚ β”œβ”€β”€ bare-pod-cleaner.yaml +β”‚ β”œβ”€β”€ replicaset-cleaner.yaml +β”‚ β”œβ”€β”€ default-ns-blocker.yaml +β”‚ β”œβ”€β”€ secret-cloner.yaml +β”‚ └── auth-sidecar-injector.yaml +β”‚ +β”œβ”€β”€ secrets/ # Application secrets (sealed) +β”‚ β”œβ”€β”€ argocd-mcp-credentials.yaml +β”‚ β”œβ”€β”€ dot-ai-secrets.yaml +β”‚ β”œβ”€β”€ mcp10x-credentials-sealed.yaml +β”‚ └── musicman-credentials.yaml +β”‚ +β”œβ”€β”€ private/ # Local-only files (NOT in Git) +β”‚ β”œβ”€β”€ *.yaml # Unsealed secrets +β”‚ └── *.sh # Helper scripts +β”‚ +└── docs/ # Documentation + β”œβ”€β”€ GITOPS-ARCHITECTURE.md # This file + β”œβ”€β”€ DEVELOPER-GUIDE.md + β”œβ”€β”€ OPERATIONS-RUNBOOK.md + └── REFERENCE.md +``` + +**Key Points**: +- `_app-of-apps.yaml` is the root Application that ArgoCD monitors +- `infra/enterprise-apps.yaml` auto-discovers all apps in `apps/` folder +- Changes pushed to this repo trigger automatic syncs in ArgoCD +- `private/` folder contains local-only files (Git-ignored) + +--- + +### 2. **Helm Charts Repository** +**Repository**: `https://github.com/snothub/forte-helm` +**Purpose**: Reusable Helm chart templates for Forte applications +**Location**: `C:\dev\k8s\forte-helm` + +``` +forte-helm/ +└── forteapp/ # Generic Forte application chart + β”œβ”€β”€ Chart.yaml # Chart metadata (v0.1.0) + β”œβ”€β”€ values.yaml # Default values (base template) + β”œβ”€β”€ templates/ + β”‚ β”œβ”€β”€ _helpers.tpl # Template helpers + β”‚ β”œβ”€β”€ namespace.yaml + β”‚ β”œβ”€β”€ deployment.yaml # Main app deployment + β”‚ β”œβ”€β”€ service.yaml + β”‚ β”œβ”€β”€ ingressroute.yaml # Traefik IngressRoute + β”‚ β”œβ”€β”€ certificate.yaml # Cert-Manager Certificate + β”‚ β”œβ”€β”€ configmap.yaml + β”‚ β”œβ”€β”€ secret-auth-tokens.yaml + β”‚ β”œβ”€β”€ hpa.yaml # Horizontal Pod Autoscaler + β”‚ β”œβ”€β”€ database-statefulset.yaml # Optional PostgreSQL DB + β”‚ └── database-service.yaml + └── README.md +``` + +**Key Points**: +- Single generic chart (`forteapp`) used by all Forte applications +- Supports optional PostgreSQL database (StatefulSet) +- Configurable authentication (token-based or OIDC) +- Traefik IngressRoute with automatic TLS via Cert-Manager +- Designed for microservices with similar patterns + +--- + +### 3. **Helm Values Repository** +**Repository**: `git@github.com:fortedigital/helm-values.git` +**Purpose**: Environment-specific configuration for each application +**Location**: `C:\dev\k8s\helm-prod-values` + +``` +helm-prod-values/ +β”œβ”€β”€ mcp10x/ +β”‚ └── values.yaml # MCP 10X configuration +β”œβ”€β”€ musicman/ +β”‚ └── values.yaml # Music Man configuration +β”œβ”€β”€ mcpcoder/ +β”‚ └── values.yaml # MCP Coder configuration +└── argocd-mcp/ + └── values.yaml # ArgoCD MCP configuration +``` + +**Key Points**: +- Each app has its own folder with `values.yaml` +- Contains environment-specific settings (image tags, env vars, resources, etc.) +- Referenced by ArgoCD Applications using multi-source pattern +- Image tags are updated here by CI/CD pipelines +- Secrets are referenced by name (actual secrets stored as SealedSecrets) + +**Example** (`mcp10x/values.yaml`): +```yaml +app: + image: + repository: ghcr.io/fortedigital/10x + tag: 2.0.4 # Updated by CI/CD + extraEnv: + - name: PORT + value: "3000" + envSecretName: "app-credentials" # References SealedSecret + +ingress: + enabled: true + host: mcp10x.forteapps.net # Public domain +``` + +--- + +### 4. **Application Source Code Repositories** +**Purpose**: Application source code with CI/CD pipelines +**Examples**: Various private repositories + +**Typical Structure**: +``` +app-repository/ +β”œβ”€β”€ src/ # Application source code +β”œβ”€β”€ Dockerfile # Container build definition +β”œβ”€β”€ .github/ +β”‚ └── workflows/ +β”‚ └── build-and-deploy.yml # GitHub Actions workflow +└── package.json / requirements.txt # Dependencies +``` + +**CI/CD Workflow** (GitHub Actions): +1. Trigger on push to `main` branch +2. Build Docker image +3. Tag with version (e.g., `v2.0.4`) +4. Push to container registry (GHCR, Docker Hub, etc.) +5. Update image tag in `helm-values` repository +6. ArgoCD detects change and syncs automatically + +--- + +## GitOps Workflow + +### The App-of-Apps Pattern + +``` +_app-of-apps.yaml (Root) + β”‚ + β”œβ”€β”€ infrastructure-apps (manages infra/) + β”‚ β”œβ”€β”€ cluster-resources-application + β”‚ β”œβ”€β”€ traefik-application + β”‚ β”œβ”€β”€ cert-manager-application + β”‚ β”œβ”€β”€ kyverno + β”‚ β”œβ”€β”€ prometheus + β”‚ β”œβ”€β”€ grafana + β”‚ └── ... (other infra apps) + β”‚ + └── enterprise-apps (manages apps/) + β”œβ”€β”€ mcp10x + β”œβ”€β”€ musicman + β”œβ”€β”€ dot-ai-stack + └── argo-mcp +``` + +**How It Works**: +1. Bootstrap script installs ArgoCD and applies `_app-of-apps.yaml` +2. ArgoCD creates the root Application which monitors `infra/` folder +3. Each YAML in `infra/` becomes a child Application +4. `enterprise-apps.yaml` monitors `apps/` folder and auto-discovers applications +5. ArgoCD continuously syncs (every 60s) and auto-heals drift + +### Sync Waves & Ordering + +Applications deploy in order using `argocd.argoproj.io/sync-wave` annotations: + +``` +Wave -1: Namespaces (created first) +Wave 0: Kyverno (policies ready before resources) +Wave 1: Cluster resources, infrastructure apps +Wave 2+: Business applications +``` + +Example: +```yaml +metadata: + annotations: + argocd.argoproj.io/sync-wave: "1" +``` + +### Multi-Source Pattern + +Applications like `mcp10x` and `musicman` use multiple sources: + +```yaml +spec: + sources: + - repoURL: https://github.com/snothub/forte-helm + path: forteapp # Helm chart templates + helm: + valueFiles: + - $values/mcp10x/values.yaml # Reference to second source + + - repoURL: git@github.com:fortedigital/helm-values.git + targetRevision: HEAD + ref: values # Named reference +``` + +**Benefits**: +- Chart templates separated from configuration +- Single chart reused across all apps +- Easy to update all apps by changing the chart +- Environment-specific values isolated in separate repo + +--- + +## CI/CD Pipeline + +### Continuous Integration + +**Application Repositories** contain GitHub Actions workflows: + +```yaml +name: Build and Deploy + +on: + push: + branches: [ main ] + +jobs: + build: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v3 + + - name: Build Docker image + run: docker build -t ghcr.io/fortedigital/app:$VERSION . + + - name: Push to registry + run: docker push ghcr.io/fortedigital/app:$VERSION + + - name: Update Helm values + run: | + git clone git@github.com:fortedigital/helm-values.git + cd helm-values/app + sed -i "s/tag: .*/tag: $VERSION/" values.yaml + git commit -am "Update app to $VERSION" + git push +``` + +### Continuous Deployment + +**ArgoCD** automatically syncs when changes are detected: + +1. **Config Repo Change**: + - Developer updates `apps/myapp.yaml` + - Pushes to `sturdy-adventure` repo + - ArgoCD detects change (60s reconciliation) + - Syncs application to cluster + +2. **Helm Values Change**: + - CI/CD updates `helm-values/myapp/values.yaml` + - ArgoCD detects change + - Pulls new Helm chart with updated values + - Applies to cluster + +3. **Sync Policy**: + ```yaml + syncPolicy: + automated: + prune: true # Remove deleted resources + selfHeal: true # Revert manual changes + retry: + limit: 5 # Retry up to 5 times + backoff: + duration: 5s + maxDuration: 3m + ``` + +### Deployment Validation + +Before applying, ArgoCD: +- βœ… Validates YAML syntax +- βœ… Checks Kubernetes schema +- βœ… Runs server-side dry-run +- βœ… Verifies resource quotas +- βœ… Applies Kyverno policies + +After applying: +- βœ… Waits for resources to become healthy +- βœ… Sends Slack notification (success/failure) +- βœ… Tracks sync status in UI + +--- + +## Security Model + +### Secret Management + +**Sealed Secrets** encrypt secrets for safe Git storage: + +```bash +# Developer creates plain secret locally +kubectl create secret generic app-creds \ + --from-literal=API_KEY=secret123 \ + --dry-run=client -o yaml > private/app-creds.yaml + +# Seal the secret using kubeseal +kubeseal --format=yaml \ + --cert=pub-cert.pem \ + < private/app-creds.yaml \ + > secrets/app-creds-sealed.yaml + +# Commit sealed secret to Git +git add secrets/app-creds-sealed.yaml +git commit -m "Add app credentials" +``` + +**Storage**: +- βœ… Sealed secrets committed to Git +- ❌ Plain secrets kept in `private/` (Git-ignored) or discarded +- ⚠️ Secret rotation process not yet established + +### Kyverno Policies + +**Policy Engine** enforces security rules: + +1. **Secret Cloning**: Automatically clones secrets to new namespaces + ```yaml + # cluster-resources/policies/secret-cloner.yaml + # Secrets labeled "allowedToBeCloned: true" are synced + ``` + +2. **Default Namespace Blocker**: Prevents use of `default` namespace +3. **Bare Pod Cleaner**: Removes pods without controllers (Deployments/StatefulSets) +4. **Deployment Verifier**: Ensures pods have proper controllers +5. **Auth Sidecar Injector**: Injects authentication proxy based on annotations + +### Repository Access + +**Private Repository Credentials** stored as SealedSecrets: + +```yaml +# cluster-resources/snothub-repo-credentials-sealed.yaml +# cluster-resources/forte10x-repo-credentials-sealed.yaml +``` + +ArgoCD uses these to access private Helm values repositories. + +### Network Security + +**Traefik Ingress** with TLS: +- All HTTP traffic redirects to HTTPS +- Let's Encrypt automatic certificate renewal +- Cert-Manager manages certificate lifecycle +- Per-application IngressRoutes with dedicated certificates + +### Authentication + +**Application-Level Auth** (optional): +- Token-based authentication (static tokens) +- OIDC integration (Keycloak, Okta, etc.) +- Auth sidecar injected via Kyverno policy +- Tokens stored in SealedSecrets + +Example: +```yaml +# In deployment.yaml template +annotations: + policies.forteapps.io/auth: "true" + policies.forteapps.io/auth-token-secret-name: "app-tokens" +``` + +--- + +## Monitoring & Observability + +### Stack Components + +1. **Prometheus**: Metrics collection and storage +2. **Grafana**: Metrics visualization and dashboards +3. **Loki**: Log aggregation +4. **Fluent-Bit**: Log shipping from pods to Loki +5. **Trivy**: Container vulnerability scanning + +### Slack Notifications + +All ArgoCD applications send notifications to shared Slack channel: + +```yaml +metadata: + annotations: + notifications.argoproj.io/subscribe.on-sync-succeeded.slack: "" + notifications.argoproj.io/subscribe.on-sync-failed.slack: "" + notifications.argoproj.io/subscribe.on-degraded.slack: "" +``` + +Notifications include: +- βœ… Sync succeeded +- ❌ Sync failed +- ⚠️ Application degraded + +--- + +## Disaster Recovery + +### Cluster Rebuild + +**Current State**: No backup routines exist yet. Cluster can be rebuilt from Git. + +**Rebuild Process**: +1. Provision new Kubernetes cluster +2. Clone `sturdy-adventure` repository +3. Run `./bootstrap.sh` +4. ArgoCD installs and syncs all applications +5. Manually recreate unsealed secrets and seal them + +**Data Loss**: +- Currently: Data loss is acceptable (internal use) +- Future: One stateful application may require backup strategy + +### GitOps Advantages for DR + +βœ… **Infrastructure as Code**: Entire cluster defined in Git +βœ… **Reproducible**: Cluster can be rebuilt identically +βœ… **Auditable**: All changes tracked in Git history +βœ… **Rollback**: Easy to revert to previous Git commit +βœ… **Multi-Cluster**: Same config can deploy to multiple clusters + +--- + +## Best Practices + +### Repository Organization + +βœ… **DO**: +- Separate infrastructure (`infra/`) from applications (`apps/`) +- Use sync waves to control deployment order +- Keep secrets in `private/` folder (Git-ignored) +- Commit only sealed secrets to Git +- Use multi-source pattern for chart/values separation + +❌ **DON'T**: +- Commit plain secrets to Git +- Mix infrastructure and application configs +- Hard-code environment-specific values in charts +- Manually modify resources in cluster (use Git) + +### GitOps Workflow + +βœ… **DO**: +- All changes through Git (single source of truth) +- Use PR reviews for production changes +- Test changes in isolated namespaces first +- Monitor ArgoCD sync status +- Respond to Slack notifications + +❌ **DON'T**: +- Use `kubectl apply` directly (breaks GitOps) +- Ignore sync failures +- Bypass ArgoCD for "quick fixes" +- Edit resources in place (`kubectl edit`) + +### Application Development + +βœ… **DO**: +- Follow the `forteapp` chart pattern +- Use semantic versioning for image tags +- Update helm-values via CI/CD +- Test locally with Docker Compose +- Document environment variables + +❌ **DON'T**: +- Use `latest` image tag +- Hard-code configuration in code +- Skip local testing +- Deploy untested images to production + +--- + +## Next Steps + +πŸ“– Continue to: +- **[Developer Guide](DEVELOPER-GUIDE.md)** - Learn how to deploy and manage applications +- **[Operations Runbook](OPERATIONS-RUNBOOK.md)** - Common operational tasks +- **[Technical Reference](REFERENCE.md)** - Detailed component documentation + +--- + +**Last Updated**: 2026-03-16 +**Maintained By**: Platform Team +**Questions?**: Contact #platform-support on Slack diff --git a/docs/OPERATIONS-RUNBOOK.md b/docs/OPERATIONS-RUNBOOK.md new file mode 100644 index 0000000..55b379b --- /dev/null +++ b/docs/OPERATIONS-RUNBOOK.md @@ -0,0 +1,1217 @@ +# Operations Runbook + +## Table of Contents +- [Overview](#overview) +- [Cluster Bootstrap](#cluster-bootstrap) +- [Day-to-Day Operations](#day-to-day-operations) +- [Application Management](#application-management) +- [Secret Management](#secret-management) +- [Monitoring & Alerting](#monitoring--alerting) +- [Troubleshooting](#troubleshooting) +- [Disaster Recovery](#disaster-recovery) +- [Maintenance Procedures](#maintenance-procedures) + +--- + +## Overview + +This runbook provides operational procedures for maintaining the Kubernetes cluster and managing applications. It's intended for platform engineers and operators with full cluster access. + +### Operator Prerequisites + +- βœ… Full kubectl access to cluster +- βœ… Write access to all Git repositories +- βœ… ArgoCD UI access +- βœ… Slack notifications configured +- βœ… Understanding of Kubernetes concepts + +--- + +## Cluster Bootstrap + +### Initial Cluster Setup + +Bootstrap a new cluster from scratch: + +#### Prerequisites + +1. **Kubernetes cluster running** (UpCloud or any K8s cluster) +2. **kubectl configured** with admin access +3. **Repositories cloned** locally + +```bash +# Verify cluster access +kubectl cluster-info +kubectl get nodes +``` + +#### Bootstrap Procedure + +```bash +# 1. Clone config repository +git clone https://github.com/snothub/sturdy-adventure.git +cd sturdy-adventure + +# 2. Set cluster name (optional) +export CLUSTER_NAME="prod-cluster-01" + +# 3. Run bootstrap script +./bootstrap.sh +``` + +**What Happens:** +1. βœ… Installs ArgoCD via Helm +2. βœ… Configures ArgoCD with custom values +3. βœ… Applies root App-of-Apps manifest +4. βœ… ArgoCD automatically syncs all applications +5. βœ… Infrastructure and apps deploy in waves + +#### Verify Bootstrap + +```bash +# Wait for ArgoCD to be ready +kubectl wait --for=condition=available --timeout=300s \ + deployment/argocd-server -n argocd + +# Check ArgoCD applications +kubectl get applications -n argocd + +# Expected output: infrastructure-apps, enterprise-apps, and all child apps +``` + +#### Post-Bootstrap Steps + +1. **Configure DNS** for ingress domains: + - `argocd.127.0.0.1.nip.io` (local dev) + - `*.forteapps.net` (production) + +2. **Verify Let's Encrypt certificates**: + ```bash + kubectl get certificate --all-namespaces + kubectl get clusterissuer + ``` + +3. **Check Kyverno policies**: + ```bash + kubectl get clusterpolicy + ``` + +4. **Verify monitoring stack**: + ```bash + kubectl get pods -n monitoring + ``` + +5. **Test Slack notifications** by triggering a sync + +--- + +## Day-to-Day Operations + +### Monitoring ArgoCD Sync Status + +#### Via Slack + +All applications send notifications to shared Slack channel: +- βœ… `on-sync-succeeded` - Deployment succeeded +- ❌ `on-sync-failed` - Deployment failed +- ⚠️ `on-degraded` - Application unhealthy + +#### Via CLI + +```bash +# List all applications +kubectl get applications -n argocd + +# Watch application status +kubectl get applications -n argocd -w + +# Get detailed status +kubectl describe application myapp -n argocd +``` + +#### Via ArgoCD UI + +```bash +# Port forward to UI +kubectl port-forward svc/argocd-server -n argocd 8080:443 + +# Access: https://localhost:8080 +# No login required (insecure mode for internal use) +``` + +### Checking Application Health + +```bash +# Quick health check for all apps +kubectl get applications -n argocd \ + -o custom-columns=NAME:.metadata.name,SYNC:.status.sync.status,HEALTH:.status.health.status + +# Expected output: +# NAME SYNC HEALTH +# infrastructure-apps Synced Healthy +# enterprise-apps Synced Healthy +# mcp10x Synced Healthy +# musicman Synced Healthy +``` + +### Manual Sync + +Force sync an application: + +```bash +# Trigger sync +kubectl patch application myapp -n argocd \ + --type merge \ + -p '{"metadata":{"annotations":{"argocd.argoproj.io/refresh":"hard"}}}' + +# Or via ArgoCD CLI (if installed) +argocd app sync myapp +``` + +### Pausing Auto-Sync + +Temporarily disable automatic syncing: + +```bash +# Edit application +kubectl edit application myapp -n argocd + +# Set automated to null +spec: + syncPolicy: + automated: null # Disable auto-sync + +# Re-enable later +spec: + syncPolicy: + automated: + prune: true + selfHeal: true +``` + +--- + +## Application Management + +### Deploying a New Application + +See [Developer Guide](DEVELOPER-GUIDE.md#deploying-your-first-application) for detailed steps. + +**Quick checklist:** +- [ ] Create `helm-values/myapp/values.yaml` +- [ ] Create `apps/myapp.yaml` in config repo +- [ ] Create SealedSecret if needed +- [ ] Commit and push changes +- [ ] Verify sync in Slack/ArgoCD +- [ ] Configure DNS for domain +- [ ] Test application accessibility + +### Removing an Application + +#### Safe Removal Procedure + +```bash +# 1. Delete ArgoCD Application (with cascade) +kubectl delete application myapp -n argocd + +# This will: +# - Remove application from ArgoCD +# - Delete all Kubernetes resources (cascade) +# - Remove namespace + +# 2. Clean up Git repositories +cd ~/dev/k8s/launchpad +git rm apps/myapp.yaml +git commit -m "Remove myapp application" +git push + +cd ~/dev/k8s/helm-prod-values +git rm -r myapp/ +git commit -m "Remove myapp values" +git push + +# 3. Remove sealed secrets (if any) +cd ~/dev/k8s/launchpad +git rm secrets/myapp-credentials-sealed.yaml +git commit -m "Remove myapp secrets" +git push +``` + +#### Removal Without Cascade + +To remove from ArgoCD but keep resources running: + +```bash +# Delete application with no cascade +kubectl patch application myapp -n argocd \ + -p '{"metadata":{"finalizers":[]}}' --type merge +kubectl delete application myapp -n argocd + +# Resources remain in cluster but are no longer managed +``` + +### Scaling Applications + +#### Manual Scaling + +```bash +# Scale deployment directly +kubectl scale deployment myapp -n myapp --replicas=3 + +# Note: If selfHeal is enabled, this will be reverted +``` + +#### GitOps Scaling + +Update `helm-values/myapp/values.yaml`: + +```yaml +app: + replicaCount: 3 # Change from 1 to 3 +``` + +Commit and push - ArgoCD will sync. + +#### Auto-Scaling (HPA) + +Enable Horizontal Pod Autoscaler: + +```yaml +# In helm-values/myapp/values.yaml +app: + hpa: + enabled: true + minReplicas: 2 + maxReplicas: 10 + targetCPUUtilizationPercentage: 70 +``` + +**Note:** Remove `replicaCount` from ArgoCD ignore list if using HPA: + +```yaml +# In apps/myapp.yaml +ignoreDifferences: +- group: apps + kind: Deployment + jsonPointers: + - /spec/replicas # Remove this line +``` + +### Rolling Back Deployments + +#### Option 1: Git Revert + +```bash +# Find the commit before the bad change +cd ~/dev/k8s/helm-prod-values +git log --oneline myapp/values.yaml + +# Revert to previous version +git revert +git push + +# ArgoCD will sync the rollback +``` + +#### Option 2: Manual Rollback + +```bash +# Rollback to previous revision +kubectl rollout undo deployment myapp -n myapp + +# Note: This will be reverted by ArgoCD selfHeal +# Make permanent by updating Git +``` + +#### Option 3: Change Image Tag + +```bash +# Edit helm-values +cd ~/dev/k8s/helm-prod-values +vim myapp/values.yaml + +# Change image tag to previous version +app: + image: + tag: v1.0.0 # Roll back from v1.0.1 + +# Commit and push +git add myapp/values.yaml +git commit -m "Rollback myapp to v1.0.0" +git push +``` + +### Resource Updates + +#### Update Resource Limits + +```yaml +# In helm-values/myapp/values.yaml +app: + resources: + requests: + cpu: 200m # Increased from 100m + memory: 512Mi # Increased from 256Mi + limits: + cpu: 1000m + memory: 2Gi +``` + +#### Enable Database + +```yaml +# In helm-values/myapp/values.yaml +db: + enabled: true + persistence: + size: 10Gi # Increase storage +``` + +--- + +## Secret Management + +### Creating Secrets + +#### Step 1: Get Public Certificate + +```bash +# Fetch sealed-secrets public cert (one-time) +kubeseal --fetch-cert \ + --controller-name=sealed-secrets-controller \ + --controller-namespace=kube-system \ + > pub-cert.pem + +# Save this certificate for future use +``` + +#### Step 2: Create Plain Secret + +```bash +# Method 1: From literal values +kubectl create secret generic myapp-credentials \ + --from-literal=API_KEY=secret123 \ + --from-literal=DB_PASSWORD=pass456 \ + --namespace=myapp \ + --dry-run=client -o yaml > private/myapp-credentials.yaml + +# Method 2: From file +kubectl create secret generic myapp-credentials \ + --from-file=.env \ + --namespace=myapp \ + --dry-run=client -o yaml > private/myapp-credentials.yaml + +# Method 3: From multiple files +kubectl create secret generic myapp-credentials \ + --from-file=api-key.txt \ + --from-file=db-password.txt \ + --namespace=myapp \ + --dry-run=client -o yaml > private/myapp-credentials.yaml +``` + +#### Step 3: Seal Secret + +```bash +kubeseal --format=yaml \ + --cert=pub-cert.pem \ + --namespace=myapp \ + < private/myapp-credentials.yaml \ + > secrets/myapp-credentials-sealed.yaml +``` + +#### Step 4: Commit Sealed Secret + +```bash +git add secrets/myapp-credentials-sealed.yaml +git commit -m "Add myapp credentials" +git push + +# Delete plain secret +rm private/myapp-credentials.yaml +``` + +### Updating Secrets + +```bash +# 1. Create new version +kubectl create secret generic myapp-credentials \ + --from-literal=API_KEY=new-secret-key \ + --from-literal=DB_PASSWORD=new-password \ + --namespace=myapp \ + --dry-run=client -o yaml > private/myapp-credentials.yaml + +# 2. Seal it +kubeseal --format=yaml \ + --cert=pub-cert.pem \ + --namespace=myapp \ + < private/myapp-credentials.yaml \ + > secrets/myapp-credentials-sealed.yaml + +# 3. Commit +git add secrets/myapp-credentials-sealed.yaml +git commit -m "Update myapp credentials" +git push + +# 4. Restart pods to pick up new secret +kubectl rollout restart deployment myapp -n myapp + +# 5. Delete plain secret +rm private/myapp-credentials.yaml +``` + +### Viewing Secrets (Unsealed) + +```bash +# List secrets in namespace +kubectl get secrets -n myapp + +# Describe secret (doesn't show values) +kubectl describe secret myapp-credentials -n myapp + +# View secret values (base64 encoded) +kubectl get secret myapp-credentials -n myapp -o yaml + +# Decode secret value +kubectl get secret myapp-credentials -n myapp \ + -o jsonpath='{.data.API_KEY}' | base64 -d +``` + +### Secret Cloning (Kyverno) + +Secrets labeled `allowedToBeCloned: "true"` in the `secrets` namespace are automatically cloned to new namespaces. + +```yaml +# Example: secrets-namespace.yaml +apiVersion: v1 +kind: Secret +metadata: + name: shared-credentials + namespace: secrets + labels: + allowedToBeCloned: "true" +type: Opaque +data: + API_KEY: +``` + +When a new namespace is created, Kyverno automatically copies this secret. + +--- + +## Monitoring & Alerting + +### Prometheus Metrics + +```bash +# Port forward to Prometheus +kubectl port-forward -n monitoring svc/prometheus-server 9090:80 + +# Access: http://localhost:9090 +``` + +**Common Queries:** +```promql +# CPU usage per pod +sum(rate(container_cpu_usage_seconds_total[5m])) by (pod) + +# Memory usage per pod +sum(container_memory_usage_bytes) by (pod) + +# Request rate per service +rate(http_requests_total[5m]) +``` + +### Grafana Dashboards + +```bash +# Port forward to Grafana +kubectl port-forward -n monitoring svc/grafana 3000:80 + +# Access: http://localhost:3000 +``` + +### Loki Logs + +```bash +# Port forward to Loki +kubectl port-forward -n monitoring svc/loki 3100:3100 + +# Query logs +curl -G -s 'http://localhost:3100/loki/api/v1/query_range' \ + --data-urlencode 'query={namespace="myapp"}' \ + --data-urlencode 'start=1h' | jq +``` + +### Fluent-Bit Log Shipping + +Verify Fluent-Bit is shipping logs: + +```bash +# Check Fluent-Bit pods +kubectl get pods -n monitoring | grep fluent-bit + +# Check logs +kubectl logs -n monitoring daemonset/fluent-bit + +# Verify Loki is receiving logs +kubectl logs -n monitoring deployment/loki | grep "POST /loki/api/v1/push" +``` + +### Trivy Vulnerability Scanning + +```bash +# Check Trivy scan results +kubectl get vulnerabilityreports --all-namespaces + +# View report for specific pod +kubectl describe vulnerabilityreport -n myapp +``` + +### Slack Notifications + +All applications have Slack notifications enabled: + +```yaml +metadata: + annotations: + notifications.argoproj.io/subscribe.on-sync-succeeded.slack: "" + notifications.argoproj.io/subscribe.on-sync-failed.slack: "" + notifications.argoproj.io/subscribe.on-degraded.slack: "" +``` + +**Test Notification:** +```bash +# Trigger a sync to test +kubectl patch application myapp -n argocd \ + --type merge \ + -p '{"metadata":{"annotations":{"argocd.argoproj.io/refresh":"hard"}}}' +``` + +--- + +## Troubleshooting + +### Application Won't Sync + +#### Check Application Status + +```bash +kubectl describe application myapp -n argocd +``` + +Look for errors in: +- `Status.Conditions` +- `Status.OperationState` + +#### Common Issues + +**Issue 1: Image Pull Error** +```bash +# Error: ErrImagePull, ImagePullBackOff + +# Check if image exists +docker pull ghcr.io/fortedigital/myapp:v1.0.0 + +# Check image pull secrets +kubectl get secrets -n myapp | grep regcred + +# Check pod events +kubectl describe pod -n myapp +``` + +**Issue 2: Invalid YAML** +```bash +# Error: unable to decode manifest + +# Validate YAML locally +kubectl apply --dry-run=client -f apps/myapp.yaml + +# Check ArgoCD application controller logs +kubectl logs -n argocd deployment/argocd-application-controller | grep myapp +``` + +**Issue 3: Resource Quota Exceeded** +```bash +# Error: exceeded quota + +# Check namespace quotas +kubectl get resourcequota -n myapp +kubectl describe resourcequota -n myapp + +# Increase quota or reduce resource requests +``` + +### Pod Crashes + +#### CrashLoopBackOff + +```bash +# Check pod status +kubectl get pods -n myapp + +# View logs +kubectl logs -n myapp +kubectl logs -n myapp --previous # Previous container + +# Check events +kubectl describe pod -n myapp +``` + +**Common Causes:** +- Application error (check logs) +- Missing environment variables +- Wrong port configuration +- Missing secrets +- Insufficient memory/CPU + +#### ImagePullBackOff + +```bash +# Check image name +kubectl get deployment myapp -n myapp -o yaml | grep image + +# Verify credentials +kubectl get secret -n myapp +``` + +#### Pending + +```bash +# Check why pod is pending +kubectl describe pod -n myapp + +# Common reasons: +# - Insufficient resources on nodes +# - PVC not bound +# - Node selector doesn't match +``` + +### Ingress / TLS Issues + +#### Application Not Accessible + +```bash +# Check IngressRoute +kubectl get ingressroute -n myapp +kubectl describe ingressroute myapp -n myapp + +# Check Traefik +kubectl get pods -n traefik +kubectl logs -n traefik deployment/traefik + +# Test with port-forward +kubectl port-forward -n myapp service/myapp 8080:3000 +curl http://localhost:8080 +``` + +#### Certificate Issues + +```bash +# Check certificates +kubectl get certificate -n myapp +kubectl describe certificate myapp-tls -n myapp + +# Check cert-manager +kubectl get clusterissuer +kubectl logs -n cert-manager deployment/cert-manager + +# Check Let's Encrypt challenges +kubectl get challenges --all-namespaces +``` + +**Manual Certificate Renewal:** +```bash +# Delete and recreate certificate +kubectl delete certificate myapp-tls -n myapp + +# Certificate will be automatically recreated +``` + +### Database Issues + +#### PostgreSQL Won't Start + +```bash +# Check StatefulSet +kubectl get statefulset -n myapp +kubectl describe statefulset postgres -n myapp + +# Check PVC +kubectl get pvc -n myapp +kubectl describe pvc -n myapp + +# Check logs +kubectl logs -n myapp postgres-0 +``` + +#### Data Persistence + +```bash +# Verify PVC is bound +kubectl get pvc -n myapp + +# Check storage class +kubectl get storageclass + +# Resize PVC (if supported) +kubectl edit pvc postgres-data-postgres-0 -n myapp +# Change: storage: 10Gi (from 5Gi) +``` + +### Kyverno Policy Issues + +#### Policy Violations + +```bash +# List policies +kubectl get clusterpolicy + +# Check policy reports +kubectl get policyreport --all-namespaces + +# View specific policy +kubectl describe clusterpolicy secret-cloner +``` + +#### Secret Not Cloned + +```bash +# Check if secret has label +kubectl get secret -n secrets --show-labels + +# Check Kyverno logs +kubectl logs -n kyverno deployment/kyverno + +# Manually trigger by recreating namespace +kubectl delete ns test-ns +kubectl create ns test-ns +``` + +### ArgoCD Issues + +#### ArgoCD UI Not Accessible + +```bash +# Check ArgoCD pods +kubectl get pods -n argocd + +# Restart ArgoCD server +kubectl rollout restart deployment argocd-server -n argocd + +# Port forward +kubectl port-forward svc/argocd-server -n argocd 8080:443 +``` + +#### Sync Takes Too Long + +```bash +# Check application controller logs +kubectl logs -n argocd deployment/argocd-application-controller + +# Increase timeout (in apps/myapp.yaml) +spec: + syncPolicy: + retry: + backoff: + maxDuration: 5m # Increase from 3m +``` + +--- + +## Disaster Recovery + +### Backup Strategy + +**Current State**: No automated backups + +**What Needs Backup**: +- ❌ Cluster state (not backed up - recreate via GitOps) +- ❌ Persistent volumes (currently not critical) +- βœ… Git repositories (GitHub provides backup) +- ⚠️ Secrets (sealed secrets in Git, unseal keys need safekeeping) + +### Cluster Rebuild + +**Scenario**: Complete cluster failure + +```bash +# 1. Provision new Kubernetes cluster + +# 2. Configure kubectl +kubectl config use-context new-cluster +kubectl cluster-info + +# 3. Bootstrap cluster +cd ~/dev/k8s/launchpad +./bootstrap.sh + +# 4. Wait for ArgoCD to sync all applications +kubectl get applications -n argocd -w + +# 5. Recreate any unsealed secrets (from password manager) +# 6. Configure DNS for new cluster IPs +# 7. Verify all applications are healthy +``` + +**Time Estimate**: 30-60 minutes + +**Data Loss**: +- Ephemeral data: Lost +- Database data: Lost (no backups currently) +- Configuration: No loss (in Git) + +### Future Backup Plan + +**Recommended**: + +1. **Velero** for cluster backups + ```bash + helm install velero vmware-tanzu/velero \ + --namespace velero \ + --create-namespace \ + --set configuration.provider=aws \ + --set configuration.backupStorageLocation[0].bucket=cluster-backups + ``` + +2. **PostgreSQL backups** via CronJob + ```yaml + # pg-backup-cronjob.yaml + kind: CronJob + spec: + schedule: "0 2 * * *" # Daily at 2am + jobTemplate: + spec: + template: + spec: + containers: + - name: pg-dump + image: postgres:16-alpine + command: + - /bin/sh + - -c + - pg_dump -U $DB_USER -d $DB_NAME > /backup/dump-$(date +%Y%m%d).sql + ``` + +3. **Sealed Secrets private key backup** + ```bash + # Backup sealed-secrets controller private key + kubectl get secret -n kube-system sealed-secrets-key \ + -o yaml > sealed-secrets-key-backup.yaml + + # Store in secure location (password manager, vault) + ``` + +--- + +## Maintenance Procedures + +### Upgrading ArgoCD + +```bash +# Check current version +kubectl get deployment argocd-server -n argocd \ + -o jsonpath='{.spec.template.spec.containers[0].image}' + +# Update version in values +vim infra/values/argocd-values.yaml + +# Or upgrade via Helm directly +helm upgrade argocd argo-cd \ + --repo https://argoproj.github.io/argo-helm \ + --namespace argocd \ + --values infra/values/argocd-values.yaml \ + --version 6.0.0 # New version + +# Verify +kubectl get pods -n argocd +``` + +### Upgrading Kubernetes Version + +```bash +# UpCloud: Upgrade via control panel or CLI + +# After upgrade, verify cluster +kubectl version +kubectl get nodes + +# Check for deprecated APIs +kubectl api-resources + +# Update any deprecated resources in Git +``` + +### Rotating TLS Certificates + +Let's Encrypt certificates auto-renew, but if manual rotation is needed: + +```bash +# Delete certificate to force renewal +kubectl delete certificate myapp-tls -n myapp + +# Cert-manager will automatically recreate +kubectl get certificate -n myapp -w +``` + +### Cleaning Up Old Resources + +```bash +# List all namespaces +kubectl get namespaces + +# Remove unused namespaces +kubectl delete namespace old-app + +# Clean up ArgoCD applications +kubectl get applications -n argocd +kubectl delete application old-app -n argocd + +# Clean up old Docker images (on nodes) +# SSH to nodes and run: +docker image prune -a --filter "until=720h" # 30 days +``` + +### DNS Management + +**Adding New Subdomain**: + +1. Add DNS A record pointing to Traefik LoadBalancer IP + ```bash + # Get LoadBalancer IP + kubectl get svc -n traefik traefik -o jsonpath='{.status.loadBalancer.ingress[0].ip}' + ``` + +2. Add to DNS provider: + ``` + myapp.forteapps.net A + ``` + +3. Verify DNS propagation: + ```bash + nslookup myapp.forteapps.net + dig myapp.forteapps.net + ``` + +### Monitoring Resource Usage + +```bash +# Node resource usage +kubectl top nodes + +# Pod resource usage +kubectl top pods --all-namespaces + +# Identify resource hogs +kubectl top pods --all-namespaces --sort-by=memory +kubectl top pods --all-namespaces --sort-by=cpu +``` + +--- + +## Advanced Operations + +### Adding a New Infrastructure Component + +Example: Adding Redis + +```bash +# 1. Create application manifest +cat > infra/redis-application.yaml < $OUTPUT_FILE + +echo "Sealed secret created: $OUTPUT_FILE" +echo "Remember to delete: $SECRET_FILE" +``` + +--- + +## Checklist Templates + +### New Application Deployment Checklist + +- [ ] Application code repository created +- [ ] Dockerfile created and tested +- [ ] GitHub Actions workflow configured +- [ ] Helm values created in `helm-prod-values/` +- [ ] ArgoCD application manifest created in `apps/` +- [ ] Secrets created and sealed +- [ ] DNS record added for domain +- [ ] Application synced successfully +- [ ] Health check passed +- [ ] Slack notification received +- [ ] Application accessible via domain +- [ ] Monitoring configured +- [ ] Documentation updated + +### Incident Response Checklist + +- [ ] Incident identified (Slack alert, monitoring) +- [ ] Severity assessed +- [ ] Incident channel created +- [ ] Initial investigation (logs, metrics, events) +- [ ] Root cause identified +- [ ] Mitigation applied +- [ ] Verification of fix +- [ ] Post-mortem scheduled +- [ ] Documentation updated + +--- + +**Last Updated**: 2026-03-16 +**Maintained By**: Platform Team +**Emergency Contact**: #platform-support on Slack diff --git a/docs/README.md b/docs/README.md new file mode 100644 index 0000000..9e374a9 --- /dev/null +++ b/docs/README.md @@ -0,0 +1,327 @@ +# Kubernetes Cluster Documentation + +Welcome to the comprehensive documentation for our Kubernetes cluster GitOps setup. This documentation covers architecture, development workflows, operations, and technical references. + +## πŸ“š Documentation Index + +### 1. [GitOps Architecture & Repository Guide](GITOPS-ARCHITECTURE.md) +**Start here to understand the system** + +Learn about: +- Overall architecture and design decisions +- Repository structure and relationships +- GitOps workflow and deployment patterns +- CI/CD pipeline integration +- Security model and best practices + +**Best for**: Understanding how everything fits together, architectural decisions, and the big picture. + +--- + +### 2. [Developer Onboarding Guide](DEVELOPER-GUIDE.md) +**For developers deploying and maintaining applications** + +Learn how to: +- Set up your local development environment +- Deploy your first application +- Update existing applications +- Manage secrets securely +- Troubleshoot common issues +- Follow development best practices + +**Best for**: New developers joining the team, deploying applications, day-to-day development workflows. + +--- + +### 3. [Operations Runbook](OPERATIONS-RUNBOOK.md) +**For platform engineers and operators** + +Learn how to: +- Bootstrap a new cluster +- Monitor and maintain applications +- Manage infrastructure components +- Handle secrets and credentials +- Troubleshoot production issues +- Perform disaster recovery +- Execute maintenance procedures + +**Best for**: Platform team members, SRE tasks, incident response, cluster maintenance. + +--- + +### 4. [Technical Reference](REFERENCE.md) +**Detailed technical specifications** + +Reference for: +- Component specifications and versions +- Helm chart templates and values +- ArgoCD configuration options +- Kyverno policy definitions +- API endpoints and interfaces +- Configuration schemas +- Complete glossary + +**Best for**: Looking up specific configuration options, understanding component details, API references. + +--- + +## πŸš€ Quick Start + +### For New Developers +1. Read [GitOps Architecture](GITOPS-ARCHITECTURE.md#overview) to understand the system +2. Follow [Developer Guide - Prerequisites](DEVELOPER-GUIDE.md#prerequisites) to set up your environment +3. Deploy your first application using [Deploying Your First Application](DEVELOPER-GUIDE.md#deploying-your-first-application) + +### For Platform Engineers +1. Understand the architecture in [GitOps Architecture](GITOPS-ARCHITECTURE.md) +2. Learn cluster bootstrap in [Operations Runbook - Cluster Bootstrap](OPERATIONS-RUNBOOK.md#cluster-bootstrap) +3. Review [Day-to-Day Operations](OPERATIONS-RUNBOOK.md#day-to-day-operations) procedures + +### For Troubleshooting +1. Check [Developer Guide - Troubleshooting](DEVELOPER-GUIDE.md#troubleshooting) for common developer issues +2. Check [Operations Runbook - Troubleshooting](OPERATIONS-RUNBOOK.md#troubleshooting) for operational issues +3. Consult [Technical Reference](REFERENCE.md) for configuration details + +--- + +## πŸ—ΊοΈ Documentation Map + +``` +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ GITOPS ARCHITECTURE β”‚ +β”‚ (System Overview, Repositories, Workflows, Security) β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ + β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” + β”‚ β”‚ + β–Ό β–Ό + β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” + β”‚ DEVELOPER GUIDE β”‚ β”‚ OPERATIONS RUNBOOK β”‚ + β”‚ (Development) β”‚ β”‚ (Operations) β”‚ + β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ β”‚ + β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β–Ό + β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” + β”‚ TECHNICAL REFERENCEβ”‚ + β”‚ (Specifications) β”‚ + β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ +``` + +--- + +## πŸ“– Reading Paths + +### Path 1: New Developer (No K8s Experience) +1. [GitOps Architecture - Overview](GITOPS-ARCHITECTURE.md#overview) +2. [GitOps Architecture - GitOps Workflow](GITOPS-ARCHITECTURE.md#gitops-workflow) +3. [Developer Guide - Understanding the Workflow](DEVELOPER-GUIDE.md#understanding-the-workflow) +4. [Developer Guide - Deploying Your First Application](DEVELOPER-GUIDE.md#deploying-your-first-application) +5. [Developer Guide - Troubleshooting](DEVELOPER-GUIDE.md#troubleshooting) + +### Path 2: Experienced Developer (Has K8s Experience) +1. [GitOps Architecture - Repository Structure](GITOPS-ARCHITECTURE.md#repository-structure) +2. [Developer Guide - Local Development Setup](DEVELOPER-GUIDE.md#local-development-setup) +3. [Developer Guide - Deploying Your First Application](DEVELOPER-GUIDE.md#deploying-your-first-application) +4. [Technical Reference - Helm Chart Reference](REFERENCE.md#helm-chart-reference) + +### Path 3: Platform Engineer / SRE +1. [GitOps Architecture](GITOPS-ARCHITECTURE.md) (entire document) +2. [Operations Runbook - Cluster Bootstrap](OPERATIONS-RUNBOOK.md#cluster-bootstrap) +3. [Operations Runbook - Day-to-Day Operations](OPERATIONS-RUNBOOK.md#day-to-day-operations) +4. [Operations Runbook - Troubleshooting](OPERATIONS-RUNBOOK.md#troubleshooting) +5. [Technical Reference](REFERENCE.md) (as needed) + +### Path 4: Quick Reference +1. [Developer Guide - Quick Reference](DEVELOPER-GUIDE.md#quick-reference) +2. [Technical Reference - Configuration Reference](REFERENCE.md#configuration-reference) +3. [Technical Reference - Glossary](REFERENCE.md#glossary) + +--- + +## πŸ” Finding Information + +### How do I...? + +| Task | Documentation | +|------|---------------| +| **Deploy a new application** | [Developer Guide - Deploying Your First Application](DEVELOPER-GUIDE.md#deploying-your-first-application) | +| **Update an existing application** | [Developer Guide - Updating an Existing Application](DEVELOPER-GUIDE.md#updating-an-existing-application) | +| **Create and seal secrets** | [Developer Guide - Working with Secrets](DEVELOPER-GUIDE.md#working-with-secrets) | +| **Troubleshoot deployment issues** | [Developer Guide - Troubleshooting](DEVELOPER-GUIDE.md#troubleshooting) | +| **Bootstrap a new cluster** | [Operations Runbook - Cluster Bootstrap](OPERATIONS-RUNBOOK.md#cluster-bootstrap) | +| **Scale an application** | [Operations Runbook - Scaling Applications](OPERATIONS-RUNBOOK.md#scaling-applications) | +| **Roll back a deployment** | [Operations Runbook - Rolling Back Deployments](OPERATIONS-RUNBOOK.md#rolling-back-deployments) | +| **Manage monitoring** | [Operations Runbook - Monitoring & Alerting](OPERATIONS-RUNBOOK.md#monitoring--alerting) | +| **Understand ArgoCD config** | [Technical Reference - ArgoCD Configuration](REFERENCE.md#argocd-configuration) | +| **Look up Helm values** | [Technical Reference - Helm Chart Reference](REFERENCE.md#helm-chart-reference) | +| **Find component versions** | [Technical Reference - Version Matrix](REFERENCE.md#version-matrix) | + +--- + +## πŸ“Š System Overview + +### Cluster Architecture + +``` +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ GitHub Repositories β”‚ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ +β”‚ β”‚ Config β”‚ β”‚ Charts β”‚ β”‚ Values β”‚ β”‚ +β”‚ β”‚ (ArgoCD) β”‚ β”‚ (Templates)β”‚ β”‚ (Environment Config) β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ + β–Ό +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ ArgoCD (GitOps Engine) β”‚ +β”‚ Sync every 60 seconds β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ + β–Ό +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ Kubernetes Cluster (UpCloud) β”‚ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ +β”‚ β”‚ Infrastructure: Traefik, Cert-Manager, Kyverno β”‚ β”‚ +β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ +β”‚ β”‚ Monitoring: Prometheus, Grafana, Loki, Fluent-Bit β”‚ β”‚ +β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ +β”‚ β”‚ Applications: mcp10x, musicman, dot-ai-stack β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ +``` + +### Key Technologies + +- **GitOps**: ArgoCD +- **Kubernetes**: UpCloud Managed Kubernetes +- **Ingress**: Traefik v2 +- **Certificates**: Cert-Manager + Let's Encrypt +- **Policies**: Kyverno +- **Secrets**: Sealed Secrets +- **Monitoring**: Prometheus + Grafana +- **Logging**: Loki + Fluent-Bit + +--- + +## πŸ› οΈ Common Tasks + +### Development Tasks + +```bash +# Deploy new application +cd ~/dev/k8s/launchpad +# Create apps/myapp.yaml and helm-prod-values/myapp/values.yaml +git add apps/myapp.yaml +git commit -m "Add myapp" +git push + +# Update application +cd ~/dev/k8s/helm-prod-values +vim myapp/values.yaml +git commit -am "Update myapp config" +git push + +# Create secret +kubeseal --format=yaml --cert=pub-cert.pem \ + < private/secret.yaml > secrets/secret-sealed.yaml +git add secrets/secret-sealed.yaml +git push +``` + +### Operations Tasks + +```bash +# Check application status +kubectl get applications -n argocd + +# View application details +kubectl describe application myapp -n argocd + +# Force sync +kubectl patch application myapp -n argocd \ + --type merge -p '{"metadata":{"annotations":{"argocd.argoproj.io/refresh":"hard"}}}' + +# Check pod logs +kubectl logs -n myapp + +# Restart deployment +kubectl rollout restart deployment myapp -n myapp +``` + +--- + +## πŸ†˜ Getting Help + +### Documentation Search Order + +1. **Quick Reference**: [Developer Guide - Quick Reference](DEVELOPER-GUIDE.md#quick-reference) +2. **Troubleshooting**: [Developer Guide - Troubleshooting](DEVELOPER-GUIDE.md#troubleshooting) or [Operations Runbook - Troubleshooting](OPERATIONS-RUNBOOK.md#troubleshooting) +3. **Technical Details**: [Technical Reference](REFERENCE.md) +4. **Architecture Context**: [GitOps Architecture](GITOPS-ARCHITECTURE.md) + +### Support Channels + +- **Slack**: #platform-support +- **Issues**: Platform team +- **Emergencies**: Escalate via Slack + +--- + +## πŸ“ Document Maintenance + +### Updating Documentation + +If you find: +- Outdated information +- Missing procedures +- Errors or typos +- Areas needing clarification + +Please: +1. Create an issue or PR in the repository +2. Notify the platform team +3. Update the relevant documentation file + +### Documentation Structure + +``` +docs/ +β”œβ”€β”€ README.md # This file (index) +β”œβ”€β”€ GITOPS-ARCHITECTURE.md # Architecture overview +β”œβ”€β”€ DEVELOPER-GUIDE.md # Developer workflows +β”œβ”€β”€ OPERATIONS-RUNBOOK.md # Operations procedures +└── REFERENCE.md # Technical specifications +``` + +--- + +## πŸ”„ Documentation Versions + +**Current Version**: 1.0.0 +**Last Updated**: 2026-03-16 +**Maintained By**: Platform Team + +### Changelog + +- **v1.0.0 (2026-03-16)**: Initial comprehensive documentation release + - GitOps Architecture guide + - Developer Onboarding guide + - Operations Runbook + - Technical Reference + - Documentation index + +--- + +## 🎯 Next Steps + +Choose your path: + +- πŸ‘¨β€πŸ’» **New Developer?** Start with [Developer Guide](DEVELOPER-GUIDE.md) +- πŸ”§ **Platform Engineer?** Read [Operations Runbook](OPERATIONS-RUNBOOK.md) +- πŸ—οΈ **Architect?** Explore [GitOps Architecture](GITOPS-ARCHITECTURE.md) +- πŸ” **Need Details?** Check [Technical Reference](REFERENCE.md) + +--- + +**Welcome to the team! πŸš€** diff --git a/docs/REFERENCE.md b/docs/REFERENCE.md new file mode 100644 index 0000000..daf0d9b --- /dev/null +++ b/docs/REFERENCE.md @@ -0,0 +1,1070 @@ +# Technical Reference + +## Table of Contents +- [Architecture Components](#architecture-components) +- [Repository Reference](#repository-reference) +- [Helm Chart Reference](#helm-chart-reference) +- [ArgoCD Configuration](#argocd-configuration) +- [Infrastructure Components](#infrastructure-components) +- [Kyverno Policies](#kyverno-policies) +- [Configuration Reference](#configuration-reference) +- [API Endpoints](#api-endpoints) +- [Glossary](#glossary) + +--- + +## Architecture Components + +### Cluster Specifications + +| Component | Value | +|-----------|-------| +| **Provider** | UpCloud Managed Kubernetes | +| **Environment** | Production (internal use) | +| **Cluster Count** | Single cluster | +| **GitOps Tool** | ArgoCD | +| **Ingress Controller** | Traefik v2 | +| **Certificate Management** | Cert-Manager + Let's Encrypt | +| **Policy Engine** | Kyverno | +| **Secret Management** | Sealed Secrets (Bitnami) | +| **Monitoring** | Prometheus + Grafana | +| **Logging** | Loki + Fluent-Bit | +| **Container Scanning** | Trivy | + +### Network Architecture + +``` +Internet + β”‚ + β–Ό +[DNS: *.forteapps.net] + β”‚ + β–Ό +[UpCloud LoadBalancer] + β”‚ + β–Ό +[Traefik Ingress Controller] + β”‚ + β”œβ”€β”€β–Ί IngressRoute (TLS termination via Cert-Manager) + β”‚ + β”œβ”€β”€β–Ί Service (ClusterIP) + β”‚ β”‚ + β”‚ └──► Pod (Application Container) + β”‚ + └──► Service (Database - ClusterIP) + β”‚ + └──► StatefulSet (PostgreSQL) +``` + +--- + +## Repository Reference + +### Config Repository: `sturdy-adventure` + +**URL**: `https://github.com/snothub/sturdy-adventure.git` + +#### Directory Structure + +``` +sturdy-adventure/ +β”œβ”€β”€ bootstrap.sh # Cluster initialization script +β”œβ”€β”€ _app-of-apps.yaml # Root ArgoCD Application +β”‚ +β”œβ”€β”€ infra/ # Infrastructure applications +β”‚ β”œβ”€β”€ cluster-resources-application.yaml +β”‚ β”œβ”€β”€ enterprise-apps.yaml +β”‚ β”œβ”€β”€ traefik-application.yaml +β”‚ β”œβ”€β”€ cert-manager-application.yaml +β”‚ β”œβ”€β”€ kyverno.yaml +β”‚ β”œβ”€β”€ kyverno-policies.yaml +β”‚ β”œβ”€β”€ prometheus.yaml +β”‚ β”œβ”€β”€ grafana.yaml +β”‚ β”œβ”€β”€ loki.yaml +β”‚ β”œβ”€β”€ fluent-bit.yaml +β”‚ β”œβ”€β”€ trivy.yaml +β”‚ β”œβ”€β”€ sealedsecrets.yaml +β”‚ β”œβ”€β”€ secrets.yaml +β”‚ └── values/ +β”‚ β”œβ”€β”€ argocd-values.yaml +β”‚ β”œβ”€β”€ prometheus-values.yaml +β”‚ β”œβ”€β”€ grafana-values.yaml +β”‚ β”œβ”€β”€ loki-values.yaml +β”‚ └── fluent-bit-values.yaml +β”‚ +β”œβ”€β”€ apps/ # Business applications +β”‚ β”œβ”€β”€ mcp10x.yaml +β”‚ β”œβ”€β”€ musicman.yaml +β”‚ β”œβ”€β”€ dot-ai-stack.yaml +β”‚ └── argo-mcp.yaml +β”‚ +β”œβ”€β”€ cluster-resources/ # Cluster-level resources +β”‚ β”œβ”€β”€ cert-manager-namespace.yaml +β”‚ β”œβ”€β”€ secrets-namespace.yaml +β”‚ β”œβ”€β”€ letsencrypt-issuer.yaml +β”‚ β”œβ”€β”€ kyverno-config.yaml +β”‚ β”œβ”€β”€ argocd-notifications-secret-sealed.yaml +β”‚ β”œβ”€β”€ snothub-repo-credentials-sealed.yaml +β”‚ β”œβ”€β”€ forte10x-repo-credentials-sealed.yaml +β”‚ β”œβ”€β”€ mcp10x-repo-credentials-sealed.yaml +β”‚ └── policies/ +β”‚ β”œβ”€β”€ deployment-verifier.yaml +β”‚ β”œβ”€β”€ label-checker.yaml +β”‚ β”œβ”€β”€ bare-pod-cleaner.yaml +β”‚ β”œβ”€β”€ replicaset-cleaner.yaml +β”‚ β”œβ”€β”€ default-ns-blocker.yaml +β”‚ β”œβ”€β”€ secret-cloner.yaml +β”‚ └── auth-sidecar-injector.yaml +β”‚ +β”œβ”€β”€ secrets/ # Application secrets (sealed) +β”‚ β”œβ”€β”€ argocd-mcp-credentials.yaml +β”‚ β”œβ”€β”€ dot-ai-secrets.yaml +β”‚ β”œβ”€β”€ mcp10x-credentials-sealed.yaml +β”‚ └── musicman-credentials.yaml +β”‚ +β”œβ”€β”€ private/ # Local-only (Git-ignored) +β”‚ β”œβ”€β”€ *.yaml +β”‚ └── *.sh +β”‚ +└── docs/ # Documentation + β”œβ”€β”€ GITOPS-ARCHITECTURE.md + β”œβ”€β”€ DEVELOPER-GUIDE.md + β”œβ”€β”€ OPERATIONS-RUNBOOK.md + └── REFERENCE.md +``` + +#### Key Files + +**`bootstrap.sh`** +```bash +#!/bin/zsh +# Initializes cluster with ArgoCD + +ArgoCd() { + helm upgrade --install argocd argo-cd \ + --repo https://argoproj.github.io/argo-helm \ + --namespace argocd --create-namespace \ + --values infra/values/argocd-values.yaml \ + --set notifications.context.clusterName="$CLUSTER_NAME" \ + --timeout 60s --atomic + + kubectl apply -f _app-of-apps.yaml -n argocd +} +``` + +**`_app-of-apps.yaml`** +```yaml +apiVersion: argoproj.io/v1alpha1 +kind: Application +metadata: + name: infrastructure-apps + namespace: argocd +spec: + project: default + source: + repoURL: https://github.com/snothub/sturdy-adventure.git + path: infra + destination: + server: https://kubernetes.default.svc + namespace: default + syncPolicy: + automated: + prune: true + selfHeal: true +``` + +--- + +### Helm Charts Repository: `forte-helm` + +**URL**: `https://github.com/snothub/forte-helm` + +#### Chart: `forteapp` + +**Version**: 0.1.0 +**App Version**: 1.0.0 +**Type**: application + +##### Templates + +| Template | Purpose | +|----------|---------| +| `_helpers.tpl` | Template helper functions | +| `namespace.yaml` | Namespace resource | +| `deployment.yaml` | Main application Deployment | +| `service.yaml` | ClusterIP Service | +| `ingressroute.yaml` | Traefik IngressRoute | +| `certificate.yaml` | Cert-Manager Certificate | +| `configmap.yaml` | Application ConfigMap | +| `secret-auth-tokens.yaml` | Authentication tokens | +| `hpa.yaml` | Horizontal Pod Autoscaler | +| `database-statefulset.yaml` | Optional PostgreSQL StatefulSet | +| `database-service.yaml` | PostgreSQL Service | + +##### Default Values Schema + +```yaml +app: + image: + repository: "" # Required + tag: "" # Required + pullPolicy: IfNotPresent + containerPort: 3000 + + replicaCount: 1 + + resources: + requests: + cpu: 100m + memory: 128Mi + limits: + cpu: 500m + memory: 512Mi + + hpa: + enabled: false + minReplicas: 2 + maxReplicas: 10 + targetCPUUtilizationPercentage: 70 + + extraEnv: [] + # - name: KEY + # value: "value" + + envSecretName: "" # Reference to Secret + nodeEnv: production + +db: + enabled: false + name: postgres + image: + repository: postgres + tag: "16-alpine" + + service: + type: ClusterIP + port: 5432 + targetPort: 5432 + + persistence: + enabled: true + storageClass: "" + accessMode: ReadWriteOnce + size: 5Gi + + resources: + requests: + memory: "256Mi" + cpu: "250m" + limits: + memory: "1Gi" + cpu: "1000m" + + extraEnv: [] + envSecretName: "" + + livenessProbe: + exec: + command: + - pg_isready + - -U + - db_user + - -d + - db_name + initialDelaySeconds: 30 + periodSeconds: 10 + + readinessProbe: + exec: + command: + - pg_isready + - -U + - db_user + - -d + - db_name + initialDelaySeconds: 5 + periodSeconds: 5 + +service: + type: ClusterIP + port: 3000 + +ingress: + enabled: false + host: "" + entrypoint: websecure + tls: + enabled: true + secretName: "" + clusterIssuer: letsencrypt-prod + +auth: + enabled: false + type: token # Options: "token", "oidc" + oidc: + authority: "" + clientId: "" + scopes: "" + callbackPath: /auth/callback + tokens: [] + # - token1 + # - token2 + +configmap: [] +# KEY: value +``` + +--- + +### Helm Values Repository: `helm-values` + +**URL**: `git@github.com:fortedigital/helm-values.git` + +#### Structure + +``` +helm-values/ +β”œβ”€β”€ mcp10x/ +β”‚ └── values.yaml +β”œβ”€β”€ musicman/ +β”‚ └── values.yaml +β”œβ”€β”€ mcpcoder/ +β”‚ └── values.yaml +└── argocd-mcp/ + └── values.yaml +``` + +#### Example: `mcp10x/values.yaml` + +```yaml +app: + image: + repository: ghcr.io/fortedigital/10x + tag: 2.0.4 # Updated by CI/CD + + extraEnv: + - name: PORT + value: "3000" + - name: SKILLS_DIR + value: "/app/skills" + - name: FLOWCASE_ENDPOINT + value: "https://forte.cvpartner.com/api/" + + envSecretName: "app-credentials" + +auth: + enabled: false + tokens: + - d4f88f6d9292c10cc3e21c4aad56d2be485db532b54fe961d738e1137d247823 + +ingress: + enabled: true + host: mcp10x.forteapps.net +``` + +--- + +## Helm Chart Reference + +### Template Functions + +#### `forteapp.fullname` +```yaml +{{ include "forteapp.fullname" . }} +# Output: +``` + +#### `forteapp.labels` +```yaml +{{ include "forteapp.labels" . }} +# Output: +# app.kubernetes.io/name: forteapp +# app.kubernetes.io/instance: +# app.kubernetes.io/version: +# app.kubernetes.io/managed-by: Helm +``` + +#### `forteapp.selectorLabels` +```yaml +{{ include "forteapp.selectorLabels" . }} +# Output: +# app.kubernetes.io/name: forteapp +# app.kubernetes.io/instance: +``` + +### Deployment Specification + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: {{ include "forteapp.fullname" . }} + labels: + {{- include "forteapp.labels" . | nindent 4 }} +spec: + replicas: {{ .Values.app.replicaCount }} + selector: + matchLabels: + {{- include "forteapp.selectorLabels" . | nindent 6 }} + template: + metadata: + annotations: + policies.forteapps.io/auth: {{ .Values.auth.enabled | quote }} + labels: + {{- include "forteapp.selectorLabels" . | nindent 8 }} + spec: + containers: + - name: app + image: "{{ .Values.app.image.repository }}:{{ .Values.app.image.tag }}" + imagePullPolicy: {{ .Values.app.image.pullPolicy }} + ports: + - name: http + containerPort: {{ .Values.app.image.containerPort }} + env: + - name: NODE_ENV + value: {{ .Values.app.nodeEnv | quote }} + {{- with .Values.app.extraEnv }} + {{- toYaml . | nindent 8 }} + {{- end }} + {{- if .Values.app.envSecretName }} + envFrom: + - secretRef: + name: {{ .Values.app.envSecretName }} + {{- end }} + resources: + {{- toYaml .Values.app.resources | nindent 10 }} + securityContext: + readOnlyRootFilesystem: true + allowPrivilegeEscalation: false +``` + +### IngressRoute Specification + +```yaml +apiVersion: traefik.io/v1alpha1 +kind: IngressRoute +metadata: + name: {{ include "forteapp.fullname" . }} +spec: + entryPoints: + - {{ .Values.ingress.entrypoint }} + routes: + - match: Host(`{{ .Values.ingress.host }}`) + kind: Rule + services: + - name: {{ include "forteapp.fullname" . }} + port: {{ .Values.service.port }} + {{- if .Values.ingress.tls.enabled }} + tls: + secretName: {{ default .Release.Name .Values.ingress.tls.secretName }}-tls + {{- end }} +``` + +### Certificate Specification + +```yaml +apiVersion: cert-manager.io/v1 +kind: Certificate +metadata: + name: {{ include "forteapp.fullname" . }}-tls +spec: + secretName: {{ default .Release.Name .Values.ingress.tls.secretName }}-tls + issuerRef: + name: {{ .Values.ingress.tls.clusterIssuer }} + kind: ClusterIssuer + dnsNames: + - {{ .Values.ingress.host }} +``` + +--- + +## ArgoCD Configuration + +### Application Manifest Schema + +```yaml +apiVersion: argoproj.io/v1alpha1 +kind: Application +metadata: + name: + namespace: argocd + annotations: + argocd.argoproj.io/sync-wave: "1" + notifications.argoproj.io/subscribe.on-sync-succeeded.slack: "" + notifications.argoproj.io/subscribe.on-sync-failed.slack: "" + notifications.argoproj.io/subscribe.on-degraded.slack: "" + labels: + app.kubernetes.io/name: + app.kubernetes.io/part-of: apps + app.kubernetes.io/managed-by: argocd + finalizers: + - resources-finalizer.argocd.argoproj.io + +spec: + project: default + + # Multi-source configuration + sources: + - repoURL: https://github.com/snothub/forte-helm + path: forteapp + targetRevision: HEAD + helm: + valueFiles: + - $values//values.yaml + + - repoURL: git@github.com:fortedigital/helm-values.git + targetRevision: HEAD + ref: values + + destination: + server: https://kubernetes.default.svc + namespace: + + syncPolicy: + automated: + prune: true + selfHeal: true + allowEmpty: false + + syncOptions: + - CreateNamespace=true + - Validate=true + - ServerSideApply=true + - Replace=false + + retry: + limit: 5 + backoff: + duration: 5s + factor: 2 + maxDuration: 3m + + ignoreDifferences: + - group: apps + kind: Deployment + jsonPointers: + - /spec/replicas +``` + +### Sync Waves + +| Wave | Components | Purpose | +|------|------------|---------| +| `-1` | Namespaces | Create namespaces first | +| `0` | Kyverno | Install policy engine | +| `1` | Cluster resources, infrastructure | Base infrastructure | +| `2+` | Applications | Business applications | + +### Sync Options + +| Option | Description | +|--------|-------------| +| `CreateNamespace=true` | Automatically create target namespace | +| `Validate=true` | Validate resources before applying | +| `ServerSideApply=true` | Use server-side apply (safer) | +| `Replace=false` | Don't use kubectl replace | +| `Prune=true` | Delete resources not in Git | + +### Retry Policy + +```yaml +retry: + limit: 5 # Max retry attempts + backoff: + duration: 5s # Initial backoff + factor: 2 # Exponential factor + maxDuration: 3m # Max backoff time +``` + +**Retry Schedule**: +1. 5 seconds +2. 10 seconds +3. 20 seconds +4. 40 seconds +5. 80 seconds (capped at 3 minutes) + +--- + +## Infrastructure Components + +### Traefik + +**Chart**: `traefik/traefik` +**Version**: Latest +**Namespace**: `traefik` + +**Configuration**: +```yaml +# infra/traefik-application.yaml +replicas: 2 + +service: + type: LoadBalancer + +ingressRoute: + dashboard: + enabled: false + +ports: + web: + redirectTo: websecure # HTTP β†’ HTTPS redirect + websecure: + tls: + enabled: true +``` + +**Endpoints**: +- HTTP: `:80` β†’ Redirects to HTTPS +- HTTPS: `:443` + +### Cert-Manager + +**Chart**: `jetstack/cert-manager` +**Namespace**: `cert-manager` + +**ClusterIssuer**: +```yaml +apiVersion: cert-manager.io/v1 +kind: ClusterIssuer +metadata: + name: letsencrypt-prod +spec: + acme: + server: https://acme-v02.api.letsencrypt.org/directory + email: admin@forteapps.net + privateKeySecretRef: + name: letsencrypt-prod-key + solvers: + - http01: + ingress: + class: traefik +``` + +### Kyverno + +**Chart**: `kyverno/kyverno` +**Namespace**: `kyverno` + +**Policies**: +- Secret cloner +- Default namespace blocker +- Bare pod cleaner +- ReplicaSet cleaner +- Deployment verifier +- Auth sidecar injector + +### Sealed Secrets + +**Chart**: `sealed-secrets/sealed-secrets-controller` +**Namespace**: `kube-system` + +**Public Certificate**: +```bash +kubeseal --fetch-cert \ + --controller-name=sealed-secrets-controller \ + --controller-namespace=kube-system \ + > pub-cert.pem +``` + +### Prometheus + +**Chart**: `prometheus-community/prometheus` +**Namespace**: `monitoring` + +**Configuration**: +```yaml +server: + persistentVolume: + enabled: true + size: 10Gi + +alertmanager: + enabled: false + +nodeExporter: + enabled: true + +kubeStateMetrics: + enabled: true +``` + +### Grafana + +**Chart**: `grafana/grafana` +**Namespace**: `monitoring` + +**Datasources**: +- Prometheus +- Loki + +### Loki + +**Chart**: `grafana/loki-stack` +**Namespace**: `monitoring` + +**Configuration**: +```yaml +loki: + persistence: + enabled: true + size: 10Gi + +promtail: + enabled: false # Using Fluent-Bit instead +``` + +### Fluent-Bit + +**Chart**: `fluent/fluent-bit` +**Namespace**: `monitoring` + +**Output**: Loki + +--- + +## Kyverno Policies + +### Secret Cloner + +**File**: `cluster-resources/policies/secret-cloner.yaml` + +**Purpose**: Automatically clone secrets from `secrets` namespace to new namespaces + +```yaml +apiVersion: kyverno.io/v1 +kind: ClusterPolicy +metadata: + name: sync-secret-with-multi-clone +spec: + rules: + - name: clone-secret + match: + any: + - resources: + kinds: + - Namespace + generate: + apiVersion: v1 + kind: Secret + name: "{{ request.object.metadata.name }}" + namespace: "{{ request.object.metadata.name }}" + synchronize: true + clone: + namespace: secrets + name: shared-credentials +``` + +**Label Requirement**: Secrets must have `allowedToBeCloned: "true"` + +### Default Namespace Blocker + +**File**: `cluster-resources/policies/default-ns-blocker.yaml` + +**Purpose**: Prevent resources from being created in `default` namespace + +```yaml +apiVersion: kyverno.io/v1 +kind: ClusterPolicy +metadata: + name: disallow-default-namespace +spec: + validationFailureAction: enforce + rules: + - name: validate-namespace + match: + any: + - resources: + kinds: + - Pod + - Deployment + - Service + validate: + message: "Using 'default' namespace is not allowed" + pattern: + metadata: + namespace: "!default" +``` + +### Bare Pod Cleaner + +**File**: `cluster-resources/policies/bare-pod-cleaner.yaml` + +**Purpose**: Delete pods without ownerReferences (not managed by Deployment/StatefulSet) + +```yaml +apiVersion: kyverno.io/v1 +kind: ClusterPolicy +metadata: + name: cleanup-bare-pods +spec: + rules: + - name: delete-bare-pod + match: + any: + - resources: + kinds: + - Pod + preconditions: + all: + - key: "{{ request.object.metadata.ownerReferences[] || '' }}" + operator: Equals + value: "" + validate: + message: "Bare pods (without controllers) are not allowed" + deny: {} +``` + +### Auth Sidecar Injector + +**File**: `cluster-resources/policies/auth-sidecar-injector.yaml` + +**Purpose**: Inject authentication sidecar based on pod annotations + +```yaml +apiVersion: kyverno.io/v1 +kind: ClusterPolicy +metadata: + name: inject-auth-sidecar +spec: + rules: + - name: inject-sidecar + match: + any: + - resources: + kinds: + - Pod + preconditions: + all: + - key: "{{ request.object.metadata.annotations.\"policies.forteapps.io/auth\" || '' }}" + operator: Equals + value: "true" + mutate: + patchStrategicMerge: + spec: + containers: + - name: auth-proxy + image: oauth2-proxy/oauth2-proxy:latest + # ... additional configuration +``` + +--- + +## Configuration Reference + +### Environment Variables + +Common environment variables used across applications: + +| Variable | Purpose | Example | +|----------|---------|---------| +| `NODE_ENV` | Node.js environment | `production` | +| `PORT` | Application port | `3000` | +| `DB_HOST` | Database host | `postgres` | +| `DB_PORT` | Database port | `5432` | +| `DB_USER` | Database user | `app_user` | +| `DB_NAME` | Database name | `app_db` | +| `DB_PASSWORD` | Database password | From secret | +| `API_KEY` | External API key | From secret | + +### Resource Limits + +Recommended resource allocation: + +| Application Type | CPU Request | Memory Request | CPU Limit | Memory Limit | +|------------------|-------------|----------------|-----------|--------------| +| **Lightweight API** | 100m | 128Mi | 500m | 512Mi | +| **Standard Web App** | 200m | 256Mi | 1000m | 1Gi | +| **Heavy Processing** | 500m | 512Mi | 2000m | 2Gi | +| **Database** | 250m | 256Mi | 1000m | 1Gi | + +### Storage Classes + +Default storage class used: **UpCloud default** (varies by provider) + +```yaml +persistence: + enabled: true + storageClass: "" # Uses default + accessMode: ReadWriteOnce + size: 5Gi +``` + +--- + +## API Endpoints + +### ArgoCD API + +``` +# Server +https://argocd.127.0.0.1.nip.io + +# Applications endpoint +GET /api/v1/applications + +# Application details +GET /api/v1/applications/{name} + +# Sync application +POST /api/v1/applications/{name}/sync +``` + +### Prometheus API + +``` +# Query endpoint +GET /api/v1/query?query={promql} + +# Query range +GET /api/v1/query_range?query={promql}&start={time}&end={time}&step={duration} + +# Metrics +GET /api/v1/label/__name__/values +``` + +### Loki API + +``` +# Query logs +GET /loki/api/v1/query?query={logql} + +# Query range +GET /loki/api/v1/query_range?query={logql}&start={time}&end={time} + +# Push logs +POST /loki/api/v1/push +``` + +--- + +## Glossary + +### Terms + +**App-of-Apps**: ArgoCD pattern where a parent Application manages child Applications + +**GitOps**: Operations approach where Git is the single source of truth + +**IngressRoute**: Traefik CRD for routing external traffic to services + +**Multi-Source**: ArgoCD feature allowing multiple Git sources per Application + +**SealedSecret**: Encrypted secret that can be safely stored in Git + +**Sync Wave**: Ordered deployment using annotations + +**Self-Heal**: ArgoCD automatically reverts manual cluster changes + +**Prune**: Automatically delete resources removed from Git + +--- + +## Annotations Reference + +### ArgoCD Annotations + +```yaml +# Sync wave (deployment order) +argocd.argoproj.io/sync-wave: "1" + +# Refresh application +argocd.argoproj.io/refresh: "hard" + +# Compare options +argocd.argoproj.io/compare-options: IgnoreExtraneous + +# Sync options per resource +argocd.argoproj.io/sync-options: Prune=false +``` + +### Kyverno Annotations + +```yaml +# Exclude from policy +policies.kyverno.io/exclude: "true" + +# Severity +policies.kyverno.io/severity: high +``` + +### Custom Annotations + +```yaml +# Authentication enabled +policies.forteapps.io/auth: "true" + +# OIDC configuration +policies.forteapps.io/auth-oidc-authority: "https://..." +policies.forteapps.io/auth-oidc-client-id: "client-id" +``` + +--- + +## Labels Reference + +### Standard Labels + +```yaml +# Application name +app.kubernetes.io/name: myapp + +# Application instance +app.kubernetes.io/instance: myapp + +# Application version +app.kubernetes.io/version: "1.0.0" + +# Component type +app.kubernetes.io/component: frontend + +# Part of larger application +app.kubernetes.io/part-of: ecommerce + +# Managed by +app.kubernetes.io/managed-by: argocd +``` + +### Custom Labels + +```yaml +# Allow secret cloning +allowedToBeCloned: "true" + +# Environment +environment: production + +# Team ownership +team: platform +``` + +--- + +## Version Matrix + +### Component Versions + +| Component | Version | Chart Version | +|-----------|---------|---------------| +| **ArgoCD** | 2.9.0+ | Latest | +| **Traefik** | 2.10.0+ | Latest | +| **Cert-Manager** | 1.13.0+ | Latest | +| **Kyverno** | 1.10.0+ | Latest | +| **Sealed Secrets** | 0.24.0+ | Latest | +| **Prometheus** | 2.47.0+ | Latest | +| **Grafana** | 10.0.0+ | Latest | +| **Loki** | 2.9.0+ | Latest | +| **Fluent-Bit** | 2.1.0+ | Latest | +| **PostgreSQL** | 16-alpine | N/A | +| **Trivy** | Latest | Latest | + +### Kubernetes Compatibility + +- **Minimum**: 1.24+ +- **Tested**: 1.28+ +- **Recommended**: Latest stable + +--- + +**Last Updated**: 2026-03-16 +**Maintained By**: Platform Team +**Version**: 1.0.0