diff --git a/devbox.json b/devbox.json index 0f146ce..ba7321a 100644 --- a/devbox.json +++ b/devbox.json @@ -19,7 +19,9 @@ "dotnet-sdk@latest", "opentofu@1.11.6", "_1password@latest", - "github-cli@latest" + "github-cli@latest", + "upcloud-cli@3.29.0", + "awscli2@2.34.24" ], "shell": { "init_hook": [ diff --git a/docs/OPERATIONS-RUNBOOK.md b/docs/OPERATIONS-RUNBOOK.md index 586a806..35f2e83 100644 --- a/docs/OPERATIONS-RUNBOOK.md +++ b/docs/OPERATIONS-RUNBOOK.md @@ -2,6 +2,12 @@ ## Table of Contents - [Overview](#overview) +- [Infrastructure Provisioning (OpenTofu)](#infrastructure-provisioning-opentofu) + - [Prerequisites](#provisioning-prerequisites) + - [Provisioning a Cluster](#provisioning-a-cluster) + - [Tearing Down a Cluster](#tearing-down-a-cluster) + - [Retrieving Kubeconfig](#retrieving-kubeconfig) + - [Platform Credentials](#platform-credentials) - [Cluster Bootstrap](#cluster-bootstrap) - [Initial Cluster Setup](#initial-cluster-setup) - [ArgoCD Repository Access Setup](#argocd-repository-access-setup) @@ -29,6 +35,120 @@ This runbook provides operational procedures for maintaining the Kubernetes clus --- +## Infrastructure Provisioning (OpenTofu) + +The `.tofu/` directory contains multi-cloud Kubernetes infrastructure-as-code using [OpenTofu](https://opentofu.org/). It provisions clusters on four cloud platforms (AKS, EKS, GKE, UpCloud), each with three environment tiers: **dev**, **prod**, and **workload**. + +### Provisioning Prerequisites {#provisioning-prerequisites} + +- **OpenTofu** (`tofu`) installed +- **kubectl** installed +- **helm** installed +- **yq** (optional — loads cluster config from `clusters/.yaml`) +- Platform CLI tools: + - **AKS**: `az` (Azure CLI) + - **EKS**: `aws` (AWS CLI) + - **GKE**: `gcloud` (Google Cloud SDK) + - **UPC**: `upctl` (UpCloud CLI) + +### Provisioning a Cluster + +```bash +# Navigate to the scripts directory +cd .tofu/scripts + +# 1. Copy and fill in credentials for your platform +cp ../configs/aks.env.example ../configs/aks.env +# Edit ../configs/aks.env with your credentials + +# 2. Provision cluster (interactive — prompts before applying) +./setup-cluster.sh aks-dev + +# 3. Dry-run only (plan without applying) +./setup-cluster.sh aks-dev --plan + +# 4. Non-interactive (skip confirmations) +./setup-cluster.sh aks-dev --auto +``` + +**Cluster name format**: `-` — e.g., `aks-dev`, `eks-prod`, `gke-workload`, `upc-dev` + +**What `setup-cluster.sh` does**: +1. Validates cluster name, extracts platform and environment +2. Checks prerequisites (tofu, kubectl, helm) +3. Loads credentials from `configs/.env` +4. Optionally loads cluster config from `clusters/.yaml` (via yq) +5. Runs `tofu init` → `tofu plan` → prompts → `tofu apply` +6. Fetches and caches kubeconfig to `private//kubeconfig` +7. Waits for all nodes to reach Ready state (300s timeout) +8. Outputs next steps: `export KUBECONFIG` + `./bootstrap.sh` + +### Tearing Down a Cluster + +```bash +# Destroy cluster infrastructure +./teardown-cluster.sh aks-dev + +# Equivalent to: +./setup-cluster.sh aks-dev --destroy +``` + +### Retrieving Kubeconfig + +```bash +# Get kubeconfig for an existing cluster (uses cache or platform CLI) +./get-kubeconfig.sh aks-dev + +# Cached kubeconfigs stored in: private//kubeconfig +``` + +Platform-specific retrieval fallbacks: +- **AKS**: `az aks get-credentials` +- **EKS**: `aws eks update-kubeconfig` +- **GKE**: `gcloud container clusters get-credentials` +- **UPC**: `upctl kubernetes config` + +### Platform Credentials + +Each platform has a `configs/.env.example` template. Copy to `.env` and populate: + +| Platform | Required Variables | Optional | +|----------|--------------------|----------| +| **AKS** | `AZURE_TENANT_ID`, `AZURE_SUBSCRIPTION_ID` | `ARM_RESOURCE_GROUP` (defaults to cluster name) | +| **EKS** | `AWS_PROFILE` (default: "default"), `AWS_REGION` (default: "eu-west-1") | `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY` | +| **GKE** | `GCP_PROJECT_ID`, `GCP_REGION` (default: "europe-west4") | `GOOGLE_APPLICATION_CREDENTIALS` (SA JSON path) | +| **UPC** | `UPCLOUD_TOKEN` | `UPCLOUD_CLUSTER_ID` (set after creation) | + +> **Note**: `.env` files are git-ignored. Never commit credentials. + +### End-to-End Workflow + +Full cluster lifecycle: provision → bootstrap → operate → teardown: + +```bash +# 1. Provision infrastructure +cd .tofu/scripts +./setup-cluster.sh aks-dev + +# 2. Export kubeconfig (printed by setup-cluster.sh) +export KUBECONFIG=$(pwd)/../../private/aks-dev/kubeconfig + +# 3. Bootstrap GitOps (ArgoCD + App-of-Apps) +cd ../.. +./bootstrap.sh aks-dev + +# 4. Verify +kubectl get applications -n argocd + +# ... operate cluster ... + +# 5. Teardown when done +cd .tofu/scripts +./teardown-cluster.sh aks-dev +``` + +--- + ## Cluster Bootstrap ### Initial Cluster Setup @@ -37,7 +157,7 @@ Bootstrap a new cluster from scratch: #### Prerequisites -1. **Kubernetes cluster running** (UpCloud, AWS EKS, Azure AKS, GCP GKE, or any K8s cluster) +1. **Kubernetes cluster running** (provisioned via `.tofu/scripts/setup-cluster.sh` or manually on UpCloud, AWS EKS, Azure AKS, GCP GKE) 2. **kubectl configured** with admin access 3. **Repositories cloned** locally @@ -1286,14 +1406,17 @@ spec: ```bash # 1. Provision new Kubernetes cluster +cd .tofu/scripts +./setup-cluster.sh upc-dev # or aks-dev, eks-prod, etc. +export KUBECONFIG=$(pwd)/../../private/upc-dev/kubeconfig -# 2. Configure kubectl -kubectl config use-context new-cluster +# 2. Verify cluster access kubectl cluster-info +kubectl get nodes # 3. Bootstrap cluster -cd ~/dev/k8s/launchpad -./bootstrap.sh +cd ../.. +./bootstrap.sh upc-dev # 4. Wait for ArgoCD to sync all applications kubectl get applications -n argocd -w diff --git a/docs/REFERENCE.md b/docs/REFERENCE.md index 7bd9e3d..04750af 100644 --- a/docs/REFERENCE.md +++ b/docs/REFERENCE.md @@ -3,6 +3,7 @@ ## Table of Contents - [Architecture Components](#architecture-components) - [Repository Reference](#repository-reference) +- [OpenTofu Infrastructure Reference](#opentofu-infrastructure-reference) - [Helm Chart Reference](#helm-chart-reference) - [ArgoCD Configuration](#argocd-configuration) - [Infrastructure Components](#infrastructure-components) @@ -207,6 +208,196 @@ launchpad/ └── REFERENCE.md ``` +--- + +## OpenTofu Infrastructure Reference + +The `.tofu/` directory provides multi-cloud Kubernetes cluster provisioning using OpenTofu. + +### Directory Structure + +``` +.tofu/ +├── configs/ # Platform credential templates (git-ignored .env files) +│ ├── aks.env.example +│ ├── eks.env.example +│ ├── gke.env.example +│ └── upc.env.example +├── platforms/ # OpenTofu modules per cloud provider +│ ├── aks/ # Azure AKS +│ │ ├── modules/cluster/ # Reusable AKS module +│ │ │ ├── main.tf # Resource group, VNet, subnet, AKS cluster +│ │ │ ├── variables.tf +│ │ │ ├── outputs.tf +│ │ │ └── providers.tf +│ │ ├── dev/ # Dev environment root +│ │ ├── prod/ # Prod environment root +│ │ └── workload/ # Workload cluster (+ external-dns identity) +│ ├── eks/ # AWS EKS (same structure) +│ ├── gke/ # GCP GKE +│ └── upc/ # UpCloud Kubernetes +└── scripts/ + ├── setup-cluster.sh # Provision cluster + ├── teardown-cluster.sh # Destroy cluster + └── get-kubeconfig.sh # Retrieve/cache kubeconfig +``` + +### Three-Tier Cluster Strategy + +Each platform defines three environment tiers: + +| Tier | Purpose | Typical Sizing | Notes | +|------|---------|---------------|-------| +| **dev** | Development/testing | Small, economical nodes (2 nodes) | No delete locks, minimal HA | +| **prod** | Production workloads | Larger nodes, multiple AZs (3 nodes) | Delete locks, HA networking | +| **workload** | Application-only cluster | Medium nodes (2 nodes) | Includes external-DNS integration, no platform services | + +### Platform Specifications + +#### AKS (Azure Kubernetes Service) + +| Resource | Description | +|----------|-------------| +| `azurerm_resource_group` | Container for all Azure resources | +| `azurerm_management_lock` | Optional CanNotDelete lock (prod) | +| `azurerm_virtual_network` | VPC, default `10.100.0.0/16` | +| `azurerm_subnet` | Node subnet, default `10.100.0.0/22` | +| `azurerm_kubernetes_cluster` | AKS with Azure CNI, OIDC issuer, Workload Identity | + +**Dev**: Standard_B2s, 2 nodes, norwayeast, no delete lock +**Prod**: Standard_D4s_v3, 3 nodes, westeurope, delete lock enabled +**Workload**: Adds `azurerm_user_assigned_identity` + federated credential for external-dns with DNS Zone Contributor role + +**Variables** (`modules/cluster/variables.tf`): +- `prefix` — resource name prefix +- `location` — Azure region +- `vnet_address_space` — default `10.100.0.0/16` +- `aks_subnet_cidr` — default `10.100.0.0/22` +- `aks_node_vm_size` — VM size (e.g., `Standard_B2s`) +- `aks_node_count` — number of nodes +- `aks_kubernetes_version` — `null` = latest +- `enable_delete_lock` — default `false` + +#### EKS (Amazon Elastic Kubernetes Service) + +| Resource | Description | +|----------|-------------| +| `aws_vpc` | VPC with DNS enabled, default `10.100.0.0/16` | +| `aws_subnet` (public) | Per-AZ, tagged `kubernetes.io/role/elb=1` | +| `aws_subnet` (private) | Per-AZ, tagged `kubernetes.io/role/internal-elb=1` | +| `aws_nat_gateway` | Single NAT (dev); prod should use one per AZ | +| `aws_eks_cluster` | EKS with public+private endpoints, OIDC issuer | +| `aws_iam_openid_connect_provider` | IRSA (IAM Roles for Service Accounts) | +| `aws_eks_node_group` | Managed nodes with auto-scaling | + +**Dev**: t3.medium, 2 nodes (min 1, max 4), eu-west-1a/b, K8s 1.30 +**Prod**: m5.xlarge, 3 nodes (min 3, max 6), eu-west-1a/b/c +**Workload**: Adds IRSA role for external-dns with Route53 permissions (ChangeResourceRecordSets, ListHostedZones, ListResourceRecordSets, ListTagsForResource) + +**Variables**: +- `region` — AWS region +- `vpc_cidr` — default `10.100.0.0/16` +- `availability_zones` — list of AZs (2–3 recommended) +- `node_instance_type`, `node_count`, `node_min_count`, `node_max_count` +- `kubernetes_version` — default `1.30` + +#### GKE (Google Kubernetes Engine) + +| Resource | Description | +|----------|-------------| +| `google_project_service` | Enables compute and container APIs | +| `google_compute_network` | Custom VPC (no auto subnets) | +| `google_compute_subnetwork` | Primary `10.100.0.0/22`, pods `10.200.0.0/14`, services `10.204.0.0/20` | +| `google_container_cluster` | Regional cluster, VPC-native, Workload Identity | +| `google_container_node_pool` | Auto-repair, auto-upgrade, GKE_METADATA mode | + +**Dev**: e2-standard-2, 2 nodes/zone, no deletion protection +**Prod**: e2-standard-4, 3 nodes/zone, deletion protection enabled +**Workload**: Adds Google SA for external-dns with `dns.admin` role + Workload Identity binding + +**Variables**: +- `project_id` — GCP project (required) +- `region` — GCP region +- `node_machine_type`, `node_count` +- `kubernetes_version` — `null` = STABLE release channel +- `deletion_protection` — default `false` + +#### UPC (UpCloud Kubernetes) + +| Resource | Description | +|----------|-------------| +| `upcloud_router` | Private router for cluster network | +| `upcloud_gateway` | NAT gateway for outbound internet | +| `upcloud_network` | Private network, DHCP, default `10.100.0.0/24` | +| `upcloud_kubernetes_cluster` | Managed K8s, private node groups | +| `upcloud_kubernetes_node_group` | Anti-affinity if node_count > 1 | + +**Dev**: DEV-1xCPU-2GB, 2 nodes, no-svg1 +**Prod**: 4xCPU-8GB, 3 nodes, de-fra1 +**Workload**: 2xCPU-4GB, 2 nodes, fi-hel1, CIDR `10.110.0.0/24` + +> **Note**: UpCloud has no native workload identity — external-DNS integration not available. + +### Workload Identity & External-DNS + +Workload clusters include keyless cloud access for external-DNS: + +| Platform | Identity Mechanism | DNS Permissions | +|----------|--------------------|-----------------| +| **AKS** | Azure Workload Identity (federated credential) | DNS Zone Contributor | +| **EKS** | IRSA (OIDC federation) | Route53 ChangeResourceRecordSets, ListHostedZones | +| **GKE** | Workload Identity (K8s SA → Google SA) | dns.admin role | +| **UPC** | N/A | N/A | + +### Naming Conventions + +- Cluster: `-aks` / `-eks` / `-gke` (derived from platform) +- Resource groups: `-rg` (Azure only) +- VPCs/Networks: `-vpc` +- Node groups: `-nodes` +- Dev prefix: `clst-dev`, Prod prefix: `clst`, Workload prefix: `clst-workload` + +### Provider Authentication + +| Platform | Auth Method | Config Source | +|----------|-------------|---------------| +| **AKS** | Azure CLI or env vars (`ARM_SUBSCRIPTION_ID`, `ARM_TENANT_ID`) | `configs/aks.env` | +| **EKS** | AWS CLI profile or explicit credentials | `configs/eks.env` | +| **GKE** | Application Default Credentials or SA JSON | `configs/gke.env` | +| **UPC** | API token (`UPCLOUD_TOKEN`) | `configs/upc.env` | + +### Scripts Reference + +#### `setup-cluster.sh` + +```bash +./setup-cluster.sh - [--plan] [--destroy] [--auto] +``` + +| Flag | Effect | +|------|--------| +| (none) | Interactive: plan → prompt → apply | +| `--plan` | Dry-run only (tofu plan) | +| `--destroy` | Destroy infrastructure | +| `--auto` | Skip confirmation prompts | + +#### `teardown-cluster.sh` + +```bash +./teardown-cluster.sh - +# Delegates to: setup-cluster.sh "$@" --destroy +``` + +#### `get-kubeconfig.sh` + +```bash +./get-kubeconfig.sh - +# Checks cache: private//kubeconfig +# Falls back to platform CLI if no cache +``` + +--- + #### Key Files **`bootstrap.sh`**