Files
launchpad/ARGOCD_COMPREHENSIVE_ANALYSIS.md
Danijel Simeunovic bec3b6310a reorg
2026-02-08 10:42:10 +01:00

246 lines
8.1 KiB
Markdown

# ArgoCD Applications Comprehensive Analysis Report
## Overview
Analyzed 11 ArgoCD Application manifests in `/argocd/apps/`. This report details current configurations, risks, best practice violations, security concerns, and operational improvements.
---
## Critical Issues Summary
### 1. Hardcoded Secrets (CRITICAL)
**Files:** grafana.yaml
- **grafana.yaml:** Admin password "forte" in plaintext
- **Impact:** Credentials exposed in Git history forever
- **Fix:** Migrate to Sealed Secrets immediately
### 2. Floating Versions (CRITICAL)
**Files:** cluster-resources-application.yaml
- Using `HEAD` instead of tagged versions
- No audit trail of deployments
- Unpredictable application behavior
- **Fix:** Pin to specific git tags or commit SHAs
### 3. Undersized Resources (HIGH)
**Files:** cert-manager, loki, prometheus, trivy
- cert-manager: 100m CPU limit (too tight for control plane)
- loki: 200m CPU, 512Mi memory (drops logs under load)
- fluent-bit: 100m CPU for all-node log collection
- **Impact:** Performance degradation, OOM kills, dropped logs
- **Fix:** Increase resource limits across all monitoring stack
### 4. No Data Persistence (HIGH)
**Files:** loki.yaml (filesystem storage), prometheus.yaml
- Loki using filesystem storage (ephemeral, lost on restart)
- Prometheus likely ephemeral (no PVC visible)
- No backup strategy
- **Fix:** Configure persistent volumes with cloud storage
---
## Application-by-Application Summary
| Application | Issues | Priority | Key Recommendation |
|-------------|--------|----------|---------------------|
| **cert-manager** | Undersized (100m), single replica, tight webhook timeout | HIGH | Increase CPU to 500m, add replicas (2-3), longer timeout |
| **cluster-resources** | Floating HEAD, RBAC missing | MEDIUM | Pin version, restrict with AppProject |
| **fluent-bit** | Placeholder URL, tight CPU (100m), HTTP server wide open | HIGH | Update repo URL, 200m CPU, restrict HTTP to localhost |
| **grafana** | Hardcoded password, placeholder URL, no persistence | CRITICAL | Sealed Secrets, update URL, add PVC |
| **kyverno** | No policies configured, no resources, no failures policies | MEDIUM | Add security policies, define resource limits |
| **loki** | Filesystem storage, no auth, single binary, tight resources | CRITICAL | S3/GCS storage, enable auth, distributed mode |
| **prometheus** | No alertmanager, service port 80, no persistence, no ingress | HIGH | Enable alertmanager, port 9090, add PVC, secure ingress |
| **sealed-secrets** | No backup procedure, single replica, no resources | MEDIUM | Document key backup, add PDB, increase replicas |
| **traefik** | TLS incomplete, LoadBalancer cloud-specific, no resources | MEDIUM | Complete TLS config, add cert-manager integration, resources |
| **trivy** | Alpha version (v0.0.7), ignoreUnfixed hides vulns, no resources | MEDIUM | Upgrade to stable (v0.3+), show all vulns, resources |
---
## Cross-Cutting Issues
### RBAC & Security (Critical)
- All apps use default project (no boundaries)
- No explicit AppProject configuration
- Cluster resources not restricted
- **Fix:** Create AppProject with granular permissions
### No Network Policies (All Namespaces)
- Unlimited pod-to-pod communication
- Monitoring stack accessible from all pods
- **Fix:** Implement NetworkPolicy for each namespace
### No Pod Disruption Budgets
- No HA guarantees during cluster operations
- Critical services can be evicted/disrupted
- **Fix:** Add PDB minAvailable: 1 for critical apps
### Incomplete TLS Configuration
- Prometheus on HTTP port 80
- Traefik TLS uses defaults (unclear)
- Fluent-bit to Loki unencrypted
- **Fix:** Implement TLS end-to-end with cert-manager
### Missing Resource Requests
- Prometheus, Traefik, Kyverno undefined
- Scheduler can overallocate resources
- **Fix:** Add requests/limits to all remaining apps
---
## Priority Remediation Roadmap
### Phase 1: CRITICAL (Immediate)
- [ ] Migrate Grafana admin password to Sealed Secrets
- [ ] Update placeholder repository URLs
- [ ] Pin floating versions (HEAD → git tags)
### Phase 2: URGENT (Week 1-2)
- [ ] Configure persistent storage for Loki
- [ ] Configure persistent storage for Prometheus
- [ ] Enable Prometheus Alertmanager
- [ ] Increase resource limits for all apps
### Phase 3: IMPORTANT (Week 2-3)
- [ ] Implement NetworkPolicies
- [ ] Create AppProject with RBAC
- [ ] Add PodDisruptionBudgets
- [ ] Configure Kyverno security policies
### Phase 4: ENHANCEMENT (Week 3-4)
- [ ] Complete TLS configuration
- [ ] Implement cert-manager integration
- [ ] Setup backup strategies
- [ ] Add comprehensive monitoring
---
## Detailed Issues by Category
### Resource Configuration
- **cert-manager:** 50m req, 100m limit (INCREASE to 250m/500m)
- **prometheus:** 250m req, 500m limit (ADEQUATE, but add to values)
- **grafana:** 100m req, 200m limit (INCREASE to 200m/400m)
- **loki:** 100m req, 200m limit (INCREASE to 200m/500m for distributed)
- **fluent-bit:** 50m req, 100m limit (INCREASE to 100m/200m)
- **traefik:** Not specified (INCREASE to 250m/500m, 256Mi/512Mi)
- **kyverno:** Not specified (ADD 100m/200m, 128Mi/256Mi)
- **trivy:** Not specified (ADD 250m/500m, 256Mi/512Mi)
- **sealedsecrets:** Not specified (ADD 100m/200m, 128Mi/256Mi)
### Storage & Persistence
- **loki:** Filesystem (CRITICAL - switch to S3/GCS)
- **prometheus:** Implicit ephemeral (ADD PVC 20-30GB)
- **grafana:** No persistence specified (QUESTIONABLE - OK for dashboards if imported)
- **sealed-secrets:** Key backup not documented (ADD backup procedure)
### High Availability
- **cert-manager:** replicaCount: 1 (INCREASE to 2-3)
- **sealed-secrets:** Implicit single replica (INCREASE to 2-3)
- **traefik:** Replicas: 2 (ADEQUATE, but add PDB)
- **monitoring stack:** Single instances (CONSIDER distributed)
### Security Gaps
- **Secrets in Git:** Grafana
- **No Authentication:** Loki (auth_enabled: false), Prometheus (open HTTP)
- **Wide Permissions:** kubectl RBAC not restricted (ADD ClusterRole)
- **No Network Policies:** All apps (ADD NetworkPolicy)
- **TLS Incomplete:** Prometheus HTTP 80, Traefik TLS {}, Fluent→Loki HTTP
---
## Key Statistics
| Metric | Count |
|--------|-------|
| Total Applications Analyzed | 11 |
| Critical Issues | 5 |
| High Priority Issues | 12 |
| Medium Priority Issues | 20+ |
| Best Practice Violations | 30+ |
| Security Concerns | 25+ |
| Apps Missing Resource Requests | 4 |
| Apps Missing Resource Limits | 3 |
| Apps Using Floating Versions | 2 |
| Apps with Hardcoded Secrets | 2 |
| Apps Requiring Persistence | 3 |
| Apps with Single Replica Critical Services | 4 |
---
## Implementation Guidance
### Sealed Secrets Setup
```bash
# Install sealed-secrets controller
kubectl apply -f ./argocd/apps/sealedsecrets.yaml
# Seal grafana password
echo -n "new-secure-password" | kubectl create secret generic grafana-admin \
--dry-run=client --from-file=password=/dev/stdin -o yaml | \
kubeseal -f - > grafana-sealed-secret.yaml
# Update application manifests to reference sealed secrets
```
### Persistent Volume for Loki
```yaml
# Add to loki values
persistence:
enabled: true
storageClassName: "fast"
size: 50Gi
accessModes:
- ReadWriteOnce
```
### AppProject for RBAC
```yaml
apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
name: platform
spec:
destinations:
- namespace: '*'
server: 'https://kubernetes.default.svc'
sourceRepos:
- 'https://github.com/snothub/*'
roles:
- name: admin
policies:
- p, proj:platform:admin, applications, *, */*, allow
```
### NetworkPolicy for Monitoring
```yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: monitoring-access
namespace: monitoring
spec:
podSelector:
matchLabels:
app: prometheus
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app: grafana
ports:
- protocol: TCP
port: 9090
```
---
## Next Steps
1. **Review this analysis** with your team
2. **Create tickets** for each critical/high issue
3. **Schedule remediation** according to roadmap
4. **Document changes** as they're made
5. **Test thoroughly** in dev/staging first
6. **Monitor impact** after production changes