# ArgoCD Applications Comprehensive Analysis Report ## Overview Analyzed 11 ArgoCD Application manifests in `/argocd/apps/`. This report details current configurations, risks, best practice violations, security concerns, and operational improvements. --- ## Critical Issues Summary ### 1. Hardcoded Secrets (CRITICAL) **Files:** grafana.yaml - **grafana.yaml:** Admin password "forte" in plaintext - **Impact:** Credentials exposed in Git history forever - **Fix:** Migrate to Sealed Secrets immediately ### 2. Floating Versions (CRITICAL) **Files:** cluster-resources-application.yaml - Using `HEAD` instead of tagged versions - No audit trail of deployments - Unpredictable application behavior - **Fix:** Pin to specific git tags or commit SHAs ### 3. Placeholder URLs (HIGH) **Files:** fluent-bit.yaml, grafana.yaml - Second source still has `https://github.com/YOUR_ORG/YOUR_GITOPS_REPO.git` - Applications fail to deploy - **Fix:** Update to actual repository URL ### 4. Undersized Resources (HIGH) **Files:** cert-manager, loki, prometheus, trivy - cert-manager: 100m CPU limit (too tight for control plane) - loki: 200m CPU, 512Mi memory (drops logs under load) - fluent-bit: 100m CPU for all-node log collection - **Impact:** Performance degradation, OOM kills, dropped logs - **Fix:** Increase resource limits across all monitoring stack ### 5. No Data Persistence (HIGH) **Files:** loki.yaml (filesystem storage), prometheus.yaml - Loki using filesystem storage (ephemeral, lost on restart) - Prometheus likely ephemeral (no PVC visible) - No backup strategy - **Fix:** Configure persistent volumes with cloud storage --- ## Application-by-Application Summary | Application | Issues | Priority | Key Recommendation | |-------------|--------|----------|---------------------| | **cert-manager** | Undersized (100m), single replica, tight webhook timeout | HIGH | Increase CPU to 500m, add replicas (2-3), longer timeout | | **cluster-resources** | Floating HEAD, RBAC missing | MEDIUM | Pin version, restrict with AppProject | | **fluent-bit** | Placeholder URL, tight CPU (100m), HTTP server wide open | HIGH | Update repo URL, 200m CPU, restrict HTTP to localhost | | **grafana** | Hardcoded password, placeholder URL, no persistence | CRITICAL | Sealed Secrets, update URL, add PVC | | **kyverno** | No policies configured, no resources, no failures policies | MEDIUM | Add security policies, define resource limits | | **loki** | Filesystem storage, no auth, single binary, tight resources | CRITICAL | S3/GCS storage, enable auth, distributed mode | | **prometheus** | No alertmanager, service port 80, no persistence, no ingress | HIGH | Enable alertmanager, port 9090, add PVC, secure ingress | | **sealed-secrets** | No backup procedure, single replica, no resources | MEDIUM | Document key backup, add PDB, increase replicas | | **traefik** | TLS incomplete, LoadBalancer cloud-specific, no resources | MEDIUM | Complete TLS config, add cert-manager integration, resources | | **trivy** | Alpha version (v0.0.7), ignoreUnfixed hides vulns, no resources | MEDIUM | Upgrade to stable (v0.3+), show all vulns, resources | --- ## Cross-Cutting Issues ### RBAC & Security (Critical) - All apps use default project (no boundaries) - No explicit AppProject configuration - Cluster resources not restricted - **Fix:** Create AppProject with granular permissions ### No Network Policies (All Namespaces) - Unlimited pod-to-pod communication - Monitoring stack accessible from all pods - **Fix:** Implement NetworkPolicy for each namespace ### No Pod Disruption Budgets - No HA guarantees during cluster operations - Critical services can be evicted/disrupted - **Fix:** Add PDB minAvailable: 1 for critical apps ### Incomplete TLS Configuration - Prometheus on HTTP port 80 - Traefik TLS uses defaults (unclear) - Fluent-bit to Loki unencrypted - **Fix:** Implement TLS end-to-end with cert-manager ### Missing Resource Requests - Prometheus, Traefik, Kyverno undefined - Scheduler can overallocate resources - **Fix:** Add requests/limits to all remaining apps --- ## Priority Remediation Roadmap ### Phase 1: CRITICAL (Immediate) - [ ] Migrate Grafana admin password to Sealed Secrets - [ ] Update placeholder repository URLs - [ ] Pin floating versions (HEAD → git tags) ### Phase 2: URGENT (Week 1-2) - [ ] Configure persistent storage for Loki - [ ] Configure persistent storage for Prometheus - [ ] Enable Prometheus Alertmanager - [ ] Increase resource limits for all apps ### Phase 3: IMPORTANT (Week 2-3) - [ ] Implement NetworkPolicies - [ ] Create AppProject with RBAC - [ ] Add PodDisruptionBudgets - [ ] Configure Kyverno security policies ### Phase 4: ENHANCEMENT (Week 3-4) - [ ] Complete TLS configuration - [ ] Implement cert-manager integration - [ ] Setup backup strategies - [ ] Add comprehensive monitoring --- ## Detailed Issues by Category ### Resource Configuration - **cert-manager:** 50m req, 100m limit (INCREASE to 250m/500m) - **prometheus:** 250m req, 500m limit (ADEQUATE, but add to values) - **grafana:** 100m req, 200m limit (INCREASE to 200m/400m) - **loki:** 100m req, 200m limit (INCREASE to 200m/500m for distributed) - **fluent-bit:** 50m req, 100m limit (INCREASE to 100m/200m) - **traefik:** Not specified (INCREASE to 250m/500m, 256Mi/512Mi) - **kyverno:** Not specified (ADD 100m/200m, 128Mi/256Mi) - **trivy:** Not specified (ADD 250m/500m, 256Mi/512Mi) - **sealedsecrets:** Not specified (ADD 100m/200m, 128Mi/256Mi) ### Storage & Persistence - **loki:** Filesystem (CRITICAL - switch to S3/GCS) - **prometheus:** Implicit ephemeral (ADD PVC 20-30GB) - **grafana:** No persistence specified (QUESTIONABLE - OK for dashboards if imported) - **sealed-secrets:** Key backup not documented (ADD backup procedure) ### High Availability - **cert-manager:** replicaCount: 1 (INCREASE to 2-3) - **sealed-secrets:** Implicit single replica (INCREASE to 2-3) - **traefik:** Replicas: 2 (ADEQUATE, but add PDB) - **monitoring stack:** Single instances (CONSIDER distributed) ### Security Gaps - **Secrets in Git:** Grafana - **No Authentication:** Loki (auth_enabled: false), Prometheus (open HTTP) - **Wide Permissions:** kubectl RBAC not restricted (ADD ClusterRole) - **No Network Policies:** All apps (ADD NetworkPolicy) - **TLS Incomplete:** Prometheus HTTP 80, Traefik TLS {}, Fluent→Loki HTTP --- ## Key Statistics | Metric | Count | |--------|-------| | Total Applications Analyzed | 11 | | Critical Issues | 5 | | High Priority Issues | 12 | | Medium Priority Issues | 20+ | | Best Practice Violations | 30+ | | Security Concerns | 25+ | | Apps Missing Resource Requests | 4 | | Apps Missing Resource Limits | 3 | | Apps Using Floating Versions | 2 | | Apps with Hardcoded Secrets | 2 | | Apps Requiring Persistence | 3 | | Apps with Single Replica Critical Services | 4 | --- ## Implementation Guidance ### Sealed Secrets Setup ```bash # Install sealed-secrets controller kubectl apply -f ./argocd/apps/sealedsecrets.yaml # Seal grafana password echo -n "new-secure-password" | kubectl create secret generic grafana-admin \ --dry-run=client --from-file=password=/dev/stdin -o yaml | \ kubeseal -f - > grafana-sealed-secret.yaml # Update application manifests to reference sealed secrets ``` ### Persistent Volume for Loki ```yaml # Add to loki values persistence: enabled: true storageClassName: "fast" size: 50Gi accessModes: - ReadWriteOnce ``` ### AppProject for RBAC ```yaml apiVersion: argoproj.io/v1alpha1 kind: AppProject metadata: name: platform spec: destinations: - namespace: '*' server: 'https://kubernetes.default.svc' sourceRepos: - 'https://github.com/snothub/*' roles: - name: admin policies: - p, proj:platform:admin, applications, *, */*, allow ``` ### NetworkPolicy for Monitoring ```yaml apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: monitoring-access namespace: monitoring spec: podSelector: matchLabels: app: prometheus policyTypes: - Ingress ingress: - from: - podSelector: matchLabels: app: grafana ports: - protocol: TCP port: 9090 ``` --- ## Next Steps 1. **Review this analysis** with your team 2. **Create tickets** for each critical/high issue 3. **Schedule remediation** according to roadmap 4. **Document changes** as they're made 5. **Test thoroughly** in dev/staging first 6. **Monitor impact** after production changes