This commit is contained in:
Danijel Simeunovic
2026-03-16 11:00:42 +01:00
parent 275c100af5
commit d02da33700
6 changed files with 4766 additions and 109 deletions

532
README.md
View File

@@ -1,149 +1,463 @@
## Overview
# Kubernetes Cluster - GitOps Configuration
This is a **Kubernetes cluster bootstrapping and GitOps configuration repository** using ArgoCD. It defines the infrastructure-as-code for deploying and managing applications, services, and policies on Kubernetes clusters.
> **Kubernetes cluster bootstrapping and GitOps configuration repository** using ArgoCD for UpCloud Managed Kubernetes
## Repository Structure
[![GitOps](https://img.shields.io/badge/GitOps-ArgoCD-blue)](https://argoproj.github.io/cd/)
[![Kubernetes](https://img.shields.io/badge/Kubernetes-UpCloud-orange)](https://upcloud.com/)
---
## 📚 Complete Documentation
**New developers and operators**: Please refer to our comprehensive documentation for detailed guides and references:
### 🎯 [**START HERE: Documentation Index**](docs/README.md)
| Document | Description | Audience |
|----------|-------------|----------|
| **[GitOps Architecture](docs/GITOPS-ARCHITECTURE.md)** | System architecture, repository structure, GitOps workflows, security model | Everyone (start here) |
| **[Developer Guide](docs/DEVELOPER-GUIDE.md)** | Local setup, deploying apps, managing secrets, troubleshooting | Developers |
| **[Operations Runbook](docs/OPERATIONS-RUNBOOK.md)** | Cluster bootstrap, day-to-day operations, incident response, maintenance | Platform Engineers, SREs |
| **[Technical Reference](docs/REFERENCE.md)** | Component specs, Helm charts, ArgoCD config, Kyverno policies, API docs | Everyone (reference) |
---
## 🚀 Quick Start
### For New Developers
```bash
# 1. Clone repositories
git clone https://github.com/snothub/sturdy-adventure.git
git clone git@github.com:fortedigital/helm-values.git
# 2. Read the guides
# - Start: docs/GITOPS-ARCHITECTURE.md
# - Follow: docs/DEVELOPER-GUIDE.md
# 3. Deploy your first app (see Developer Guide)
```
### For Operators
```bash
# 1. Bootstrap new cluster
./bootstrap.sh
# 2. Verify deployment
kubectl get applications -n argocd
kubectl get pods --all-namespaces
# 3. Read Operations Runbook for day-to-day tasks
```
---
## 📋 Overview
This repository contains the complete GitOps configuration for our Kubernetes cluster, using the **App-of-Apps pattern** with ArgoCD.
### What's Inside
- **Infrastructure Applications**: Traefik, Cert-Manager, Kyverno, Prometheus, Grafana, Loki, Sealed Secrets
- **Business Applications**: MCP10X, MusicMan, Dot-AI Stack, ArgoCD MCP
- **Policies**: Kyverno security policies for secret management, namespace controls, pod verification
- **Monitoring**: Full observability stack with metrics, logs, and alerting
- **Secrets**: Sealed Secrets for secure Git storage
### Key Features
**GitOps-Native**: Git is the single source of truth
**Auto-Sync**: Changes automatically deployed (60s reconciliation)
**Self-Healing**: Manual cluster changes are reverted
**Multi-Source**: Separate chart templates from configuration
**Policy Enforcement**: Kyverno ensures security and compliance
**TLS Everywhere**: Automatic Let's Encrypt certificates
**Full Observability**: Prometheus, Grafana, Loki integration
---
## 🗂️ Repository Structure
```
.
├── bootstrap.sh # Main bootstrap script to initialize ArgoCD and cluster
├── _app-of-apps.yaml # App-of-apps pattern: main ArgoCD Application that manages all other apps
├── apps/ # Business application resources
│ ├── feedback-hub.yaml # Feedback Hub test app
│ ├── musicman.yaml # Music Man hackathon app
── dot-ai-stack.yaml # dot-ai AI assistant stack
├── infra/ # Individual ArgoCD Application resources for infrastructure
│ ├── enterprise-apps.yaml # Enterprise apps: parent Application that syncs everything in "apps" folder
│ ├── traefik-application.yaml # Ingress controller (Traefik)
│ ├── cert-manager-application.yaml # TLS certificate management
│ ├── kyverno.yaml # Policy engine for security
│ ├── kyverno-policies.yaml # Kyverno policy definitions
│ ├── prometheus.yaml # Metrics & monitoring
│ ├── grafana.yaml # Monitoring visualization
── loki.yaml # Log aggregation
├── fluent-bit.yaml # Log shipping
│ ├── trivy.yaml # Container scanning
│ ├── sealedsecrets.yaml # Secret encryption
│ ├── cluster-resources-application.yaml # Cluster-wide resources
── values/ # Helm value overrides for ArgoCD and services
── argocd-values.yaml # ArgoCD server configuration
├── prometheus-values.yaml
│ ├── grafana-values.yaml
├── loki-values.yaml
└── fluent-bit-values.yaml
└── cluster-resources/ # Cluster-level configurations managed by cluster-resources-application.yaml
── cert-manager-namespace.yaml
├── secrets-namespace.yaml # Namespace for secrets
├── letsencrypt-issuer.yaml # TLS certificate issuer
├── kyverno-config.yaml # Security policies and secret syncing
── argocd-notifications-secret-sealed.yaml # Sealed secret for ArgoCD notifications
└── policies/ # Kyverno policy definitions
├── deployment-verifier.yaml # Policy to verify pods have controllers
── label-checker.yaml # Policy to check labels
├── bare-pod-cleaner.yaml # Policy to clean up pods without controllers
├── replicaset-cleaner.yaml # Policy to clean up orphaned replica sets
── default-ns-blocker.yaml # Policy to block use of default namespace
└── secret-cloner.yaml # Policy to clone secrets across namespaces
├── bootstrap.sh # Cluster initialization script
├── _app-of-apps.yaml # Root ArgoCD Application (App-of-Apps pattern)
├── infra/ # Infrastructure ArgoCD Applications
│ ├── enterprise-apps.yaml # Manages all apps in apps/ folder
── traefik-application.yaml
│ ├── cert-manager-application.yaml
│ ├── kyverno.yaml
│ ├── prometheus.yaml
│ ├── grafana.yaml
│ ├── loki.yaml
│ ├── fluent-bit.yaml
│ ├── trivy.yaml
│ ├── sealedsecrets.yaml
── values/ # Helm value overrides
├── apps/ # Business Applications
│ ├── mcp10x.yaml
│ ├── musicman.yaml
── dot-ai-stack.yaml
── argo-mcp.yaml
├── cluster-resources/ # Cluster-wide Kubernetes resources
│ ├── letsencrypt-issuer.yaml
├── kyverno-config.yaml
│ ├── *-sealed.yaml # Sealed secrets
── policies/ # Kyverno policies
├── secret-cloner.yaml
├── default-ns-blocker.yaml
├── bare-pod-cleaner.yaml
── auth-sidecar-injector.yaml
├── secrets/ # Application secrets (sealed)
── *-credentials-sealed.yaml
├── private/ # Local-only files (Git-ignored)
── *.yaml # Unsealed secrets (never committed)
└── docs/ # 📚 Comprehensive documentation
├── README.md # Documentation index
├── GITOPS-ARCHITECTURE.md # Architecture guide
├── DEVELOPER-GUIDE.md # Developer onboarding
├── OPERATIONS-RUNBOOK.md # Operations procedures
└── REFERENCE.md # Technical reference
```
## Architecture & Key Concepts
**See [GitOps Architecture - Repository Structure](docs/GITOPS-ARCHITECTURE.md#repository-structure) for detailed explanation.**
### GitOps Model
- **App-of-Apps Pattern**: `_app-of-apps.yaml` is the root Application that manages all infrastructure applications
- **App-of-Apps Pattern**: `infra/enterprise-apps.yaml` is the main Application that manages all custom applications
- **Source of Truth**: GitHub repository (`https://github.com/snothub/sturdy-adventure.git`) is the single source of truth
- **Auto-sync**: All Applications have automated sync enabled with auto-pruning and self-healing
- **Namespace Creation**: `CreateNamespace=true` allows ArgoCD to create namespaces as needed
---
### Key Components
## 🏗️ Architecture
1. **Traefik** - Kubernetes Ingress controller for routing external traffic with HTTP/HTTPS redirect
2. **Cert-Manager** - Automates TLS certificate management with Let's Encrypt (see `letsencrypt-issuer.yaml`)
3. **Kyverno** - Policy engine that enforces security rules and syncs secrets across namespaces (via `sync-secret-with-multi-clone` policy)
4. **Monitoring Stack** - Prometheus (metrics) + Grafana (visualization) + Loki (logs) + Fluent-Bit (log shipping)
5. **Trivy** - Container vulnerability scanning
6. **Sealed Secrets** - Encrypts secrets for safe storage in Git
### Three-Repository Pattern
### Secret Management
- **Kyverno ClusterPolicy**: Automatically clones secrets from the `secrets` namespace to new namespaces when they're created
- Only secrets labeled `allowedToBeCloned: "true"` are cloned
- Syncing happens automatically via `synchronize: true` in the policy
| Repository | Purpose | You Edit |
|------------|---------|----------|
| **[sturdy-adventure](https://github.com/snothub/sturdy-adventure.git)** (this repo) | ArgoCD Applications, cluster resources | ✅ Often |
| **[forte-helm](https://github.com/snothub/forte-helm)** | Generic Helm chart templates | ❌ Rarely |
| **[helm-values](git@github.com:fortedigital/helm-values.git)** | App-specific configuration & versions | ✅ Sometimes |
### Network Configuration
- ArgoCD UI: `argocd.127.0.0.1.nip.io` (local development)
- Server runs in insecure mode (`--insecure`, `--disable-auth`) - suitable for local/dev clusters
- Traefik routes to multiple services via Kubernetes Ingress
### GitOps Workflow
## Common Commands
```
Developer commits code → CI/CD builds image → Updates helm-values → ArgoCD syncs → Deployed to cluster
```
**Learn more**: [GitOps Architecture - GitOps Workflow](docs/GITOPS-ARCHITECTURE.md#gitops-workflow)
---
## 🔧 Common Tasks
### Deploy a New Application
**See detailed guide**: [Developer Guide - Deploying Your First Application](docs/DEVELOPER-GUIDE.md#deploying-your-first-application)
**Quick version**:
1. Create `apps/myapp.yaml` (ArgoCD Application manifest)
2. Create `helm-values/myapp/values.yaml` (configuration)
3. Create sealed secrets if needed
4. Commit and push - ArgoCD auto-syncs!
### Update an Existing Application
**See detailed guide**: [Developer Guide - Updating an Existing Application](docs/DEVELOPER-GUIDE.md#updating-an-existing-application)
**Quick version**:
- **Update code**: Push to app repo → CI/CD updates image tag in helm-values
- **Update config**: Edit `helm-values/myapp/values.yaml` → commit → push
### Manage Secrets
**See detailed guide**: [Developer Guide - Working with Secrets](docs/DEVELOPER-GUIDE.md#working-with-secrets)
### Bootstrap the Cluster
```bash
# Create plain secret
kubectl create secret generic myapp-creds \
--from-literal=KEY=value \
--dry-run=client -o yaml > private/myapp-creds.yaml
# Seal it
kubeseal --format=yaml --cert=pub-cert.pem \
< private/myapp-creds.yaml > secrets/myapp-creds-sealed.yaml
# Commit sealed version
git add secrets/myapp-creds-sealed.yaml
git commit -m "Add myapp credentials"
git push
```
### Bootstrap Cluster
**See detailed guide**: [Operations Runbook - Cluster Bootstrap](docs/OPERATIONS-RUNBOOK.md#cluster-bootstrap)
```bash
# Initialize new cluster
./bootstrap.sh
```
This runs the `Bootstrap()` function which calls `ArgoCd()` to install ArgoCD using Helm.
### Monitor ArgoCD Applications
# Verify
kubectl get applications -n argocd
kubectl get pods --all-namespaces
```
---
## 🛠️ Quick Reference
### Monitor Applications
```bash
# View all ArgoCD applications
# List all ArgoCD applications
kubectl get applications -n argocd
# Watch sync status
kubectl get applications -n argocd -w
# Describe a specific application
kubectl describe app <app-name> -n argocd
# Check specific application
kubectl describe application myapp -n argocd
# View application logs
kubectl logs -n myapp <pod-name>
```
### Manage ArgoCD
### Access UIs
```bash
# Port forward to access UI
# ArgoCD UI
kubectl port-forward svc/argocd-server -n argocd 8080:443
# Access: https://localhost:8080 (no auth required)
# Access at: https://localhost:8080 (admin auth disabled in dev)
# Grafana
kubectl port-forward -n monitoring svc/grafana 3000:80
# Access: http://localhost:3000
# Prometheus
kubectl port-forward -n monitoring svc/prometheus-server 9090:80
# Access: http://localhost:9090
```
### Check Secret Syncing
### Troubleshooting
```bash
# Verify Kyverno policy is applied
kubectl get clusterpolicy sync-secret-with-multi-clone
# Check pod status
kubectl get pods -n myapp
# Check if secrets are synced to a namespace
kubectl get secrets -n <namespace>
# View pod logs
kubectl logs -n myapp <pod-name>
# Check pod events
kubectl describe pod -n myapp <pod-name>
# Check ArgoCD sync errors
kubectl describe application myapp -n argocd
# Force sync
kubectl patch application myapp -n argocd \
--type merge -p '{"metadata":{"annotations":{"argocd.argoproj.io/refresh":"hard"}}}'
```
### Deploy Changes
- Changes to YAML files in `apps/`, `infra/`, `**/values/`, or `cluster-resources/` are automatically synced by ArgoCD
- Push changes to the GitHub repository for them to be reflected
- ArgoCD reconciliation happens every 60s (`timeout.reconciliation: 60s`)
- Each application has a 5-minute sync timeout to prevent stalled deployments
**Full troubleshooting guide**: [Developer Guide - Troubleshooting](docs/DEVELOPER-GUIDE.md#troubleshooting)
### Review Helm Values
Application-specific Helm value overrides are in `**/values/` and referenced within each Application's Helm configuration. Each application manifest uses both external value files and inline overrides where needed.
---
### Application Organization & Sync Ordering
- Infrastructure applications use `argocd.argoproj.io/sync-wave` annotations for ordered deployment
- Kyverno (sync-wave: 0) deploys before cluster-resources (sync-wave: 1) to ensure policies are ready
- All applications have resource requests and limits configured to prevent resource starvation
- Applications are labeled with `app.kubernetes.io/part-of` to indicate their component type (platform, monitoring-stack, application)
## 🔐 Security
## Important Notes
### Secret Management
- ✅ Sealed Secrets for Git storage
- ✅ Kyverno auto-clones secrets to namespaces
- ❌ Never commit plain secrets
- **No admin auth in development**: ArgoCD has `admin.enabled: "false"` - suitable for local/dev only
- **Insecure server mode**: `--insecure` and `--disable-auth` flags are set - not for production
- **Folder organization**:
- `infra/` contains infrastructure/platform components (Traefik, Cert-Manager, Prometheus, Grafana, Loki, etc.)
- `apps/` is reserved for business applications (currently empty)
- **Replica counts**: Traefik runs 2 replicas; other services run 1 replica
- **Retry policy**: All applications retry up to 5 times with exponential backoff (max 3m timeout per application)
- **Ignore replica scaling**: Deployments ignore replica count differences to allow HPA/manual scaling
- **Sync validation**: All applications validate manifests before applying (`Validate=true`)
- **Server-side apply**: All applications use `ServerSideApply=true` for safer field ownership tracking
### Network Security
- ✅ All traffic TLS-encrypted (Let's Encrypt)
- ✅ HTTP → HTTPS redirect
- ✅ Traefik IngressRoute per application
## Development Tips
### Policy Enforcement
- ✅ Kyverno policies for security
- ✅ Default namespace blocked
- ✅ Bare pods not allowed
- ✅ Optional authentication sidecar injection
- **Check ArgoCD logs**: `kubectl logs -n argocd deployment/argocd-application-controller`
- **Validate YAML**: Files are validated server-side (`Validate=true`) before applying
- **Resource tracking**: Uses annotation-based method (`application.resourceTrackingMethod: annotation`)
- **Modify applications**: Edit the corresponding YAML in `infra/` and push to trigger sync
- **Add new services**: Create a new Application YAML in `apps/` following the pattern of existing ones, then it will be auto-discovered by the app-of-apps
- **Application folder naming**: Infrastructure components are in `infra/`; `apps/` is reserved for business applications
**Learn more**: [GitOps Architecture - Security Model](docs/GITOPS-ARCHITECTURE.md#security-model)
---
## 📊 Infrastructure Components
| Component | Purpose | Namespace | Replicas |
|-----------|---------|-----------|----------|
| **ArgoCD** | GitOps controller | `argocd` | 1 |
| **Traefik** | Ingress controller | `traefik` | 2 |
| **Cert-Manager** | TLS certificates | `cert-manager` | 1 |
| **Kyverno** | Policy engine | `kyverno` | 1 |
| **Sealed Secrets** | Secret encryption | `kube-system` | 1 |
| **Prometheus** | Metrics | `monitoring` | 1 |
| **Grafana** | Dashboards | `monitoring` | 1 |
| **Loki** | Logs | `monitoring` | 1 |
| **Fluent-Bit** | Log shipping | `monitoring` | DaemonSet |
| **Trivy** | Vulnerability scanning | `trivy-system` | 1 |
**Full specs**: [Technical Reference - Infrastructure Components](docs/REFERENCE.md#infrastructure-components)
---
## 🌐 Domains & Networking
- **Local development**: `*.127.0.0.1.nip.io`
- **Production**: `*.forteapps.net`
- **DNS**: Manual configuration (contact platform team)
- **TLS**: Automatic via Let's Encrypt
---
## 📖 Key Concepts
### App-of-Apps Pattern
`_app-of-apps.yaml` is the root Application that manages all other Applications in `infra/`. Each YAML in `infra/` becomes a child Application managed by ArgoCD.
### Multi-Source Pattern
Applications reference both:
1. **Helm charts** from `forte-helm` (templates)
2. **Values** from `helm-values` (configuration)
This separates reusable templates from environment-specific config.
### Sync Waves
Applications deploy in order using `argocd.argoproj.io/sync-wave`:
- Wave `-1`: Namespaces
- Wave `0`: Kyverno (policies)
- Wave `1`: Infrastructure
- Wave `2+`: Applications
### Auto-Sync & Self-Heal
- **Auto-Sync**: ArgoCD automatically deploys Git changes (60s polling)
- **Self-Heal**: Manual cluster changes are reverted to match Git
- **Prune**: Deleted resources in Git are removed from cluster
**Learn more**: [GitOps Architecture - GitOps Workflow](docs/GITOPS-ARCHITECTURE.md#gitops-workflow)
---
## ⚙️ Configuration
### ArgoCD Settings
- **Reconciliation**: Every 60 seconds
- **Sync timeout**: 5 minutes per application
- **Retry policy**: 5 attempts with exponential backoff
- **Authentication**: Disabled (internal use only)
### Application Defaults
- **Auto-sync**: Enabled
- **Self-heal**: Enabled
- **Prune**: Enabled
- **Validation**: Server-side validation enabled
- **Server-side apply**: Enabled
**Full configuration**: [Technical Reference - ArgoCD Configuration](docs/REFERENCE.md#argocd-configuration)
---
## 🆘 Getting Help
### Documentation
1. **Start here**: [Documentation Index](docs/README.md)
2. **For development**: [Developer Guide](docs/DEVELOPER-GUIDE.md)
3. **For operations**: [Operations Runbook](docs/OPERATIONS-RUNBOOK.md)
4. **For reference**: [Technical Reference](docs/REFERENCE.md)
### Support
- **Slack**: #platform-support
- **Issues**: Contact platform team
- **Emergencies**: Escalate via Slack
### Common Questions
| Question | Answer |
|----------|--------|
| How do I deploy an app? | [Developer Guide - Deploying Your First Application](docs/DEVELOPER-GUIDE.md#deploying-your-first-application) |
| How do I manage secrets? | [Developer Guide - Working with Secrets](docs/DEVELOPER-GUIDE.md#working-with-secrets) |
| App won't sync? | [Developer Guide - Troubleshooting](docs/DEVELOPER-GUIDE.md#troubleshooting) |
| How do I bootstrap a cluster? | [Operations Runbook - Cluster Bootstrap](docs/OPERATIONS-RUNBOOK.md#cluster-bootstrap) |
| Where are the logs? | [Operations Runbook - Monitoring & Alerting](docs/OPERATIONS-RUNBOOK.md#monitoring--alerting) |
---
## 🤝 Contributing
### Adding a New Application
1. Read [Developer Guide - Deploying Your First Application](docs/DEVELOPER-GUIDE.md#deploying-your-first-application)
2. Create ArgoCD Application manifest in `apps/`
3. Create Helm values in `helm-values/`
4. Create sealed secrets if needed
5. Commit and push - ArgoCD handles the rest!
### Modifying Infrastructure
1. Read [Operations Runbook](docs/OPERATIONS-RUNBOOK.md)
2. Update relevant files in `infra/` or `cluster-resources/`
3. Test changes in isolated namespace if possible
4. Commit and push
5. Monitor sync status in Slack/ArgoCD UI
### Updating Documentation
Documentation lives in `docs/`. To update:
1. Edit relevant markdown file
2. Update "Last Updated" date
3. Submit PR or push directly
4. Notify team of significant changes
---
## 📝 Notes
### Current Environment
- **Provider**: UpCloud Managed Kubernetes
- **Environment**: Production (internal use only)
- **Cluster**: Single cluster
- **Auth**: Disabled for ArgoCD (internal access)
- **Backup**: None (cluster rebuildable via GitOps)
### Known Limitations
- No automated backups (yet)
- Secret rotation not automated
- Single cluster (no multi-cluster setup)
- DNS management is manual
**Future improvements**: See [Operations Runbook - Disaster Recovery](docs/OPERATIONS-RUNBOOK.md#disaster-recovery)
---
## 📚 Additional Resources
### External Documentation
- [ArgoCD Documentation](https://argo-cd.readthedocs.io/)
- [Kyverno Documentation](https://kyverno.io/docs/)
- [Traefik Documentation](https://doc.traefik.io/traefik/)
- [Cert-Manager Documentation](https://cert-manager.io/docs/)
- [Sealed Secrets](https://github.com/bitnami-labs/sealed-secrets)
### Related Repositories
- [forte-helm](https://github.com/snothub/forte-helm) - Helm chart templates
- [helm-values](git@github.com:fortedigital/helm-values.git) - Application values
---
## 📄 License
Internal use only. Not for public distribution.
---
## 👥 Maintainers
**Platform Team**
- Contact: #platform-support on Slack
- Issues: Create issue in repository or contact team directly
---
**Last Updated**: 2026-03-16
**Documentation Version**: 1.0.0
**🚀 Ready to get started? Check out the [Documentation Index](docs/README.md)!**