Observability Stack: Prometheus + Grafana + Loki + OpenTelemetry
Date: 2026-03-23 Status: Accepted Context: Providing monitoring and logging for the platform, with student-facing dashboards for self-service debugging
Decision
Use the CNCF observability stack: Prometheus for metrics, Loki for logs, OpenTelemetry for log collection, and Grafana for visualization. Grafana is configured with anonymous read-only access so students can view dashboards without accounts.
Architecture
graph LR
pods["Application pods"] -->|"scrape metrics"| prom["Prometheus"]
logs["Pod logs"] --> otel["OTel Collector<br/>(DaemonSet)"]
otel --> loki["Loki"]
bb["Blackbox Exporter"] -->|"probe results"| prom
prom --> grafana["Grafana<br/>(dashboards)"]
loki --> grafana
Key configuration choices
| Component | Configuration | Rationale |
|---|---|---|
| Prometheus | 7-day retention, 5GB limit | Sufficient for debugging; longer history not needed for education |
| Loki | SingleBinary mode, filesystem storage | Simplest deployment; no object store needed at our scale |
| OTel Collector | DaemonSet, filelog receiver | Collects from all pods on each node; filters out DEBUG/TRACE |
| Grafana | Anonymous access (Viewer role) | Students can self-diagnose without accounts or credentials |
| Grafana | Custom home dashboard | Hides platform internals; students see only what's relevant to them |
| Blackbox Exporter | HTTP probes per team | Monitors team application uptime from outside |
Rationale
- Industry-standard CNCF tools — widely adopted, well-documented, large community. Skills transfer to industry.
- Student-facing dashboards — Grafana gives students self-service visibility into their deployments (version history, uptime, logs) without needing kubectl access. This is a key feedback mechanism.
- Open-source / free — commercial alternatives (Datadog, New Relic, Elastic Cloud) are not feasible for an educational platform, both due to cost and data sovereignty concerns.
Alternatives Considered
ELK stack (Elasticsearch + Logstash + Kibana)
- ✅ Powerful full-text search
- ❌ Significantly heavier resource footprint (Elasticsearch is memory-hungry)
- ❌ More complex to operate
- Rejected: Overkill for our log volume and query needs
Datadog / New Relic / commercial APM
- ✅ Fully managed, feature-rich
- ❌ Licensing costs not feasible for education
- ❌ Data leaves our infrastructure
- Rejected: Cost and data sovereignty
No monitoring (students use logs only)
- ❌ Students lose self-service debugging capability
- ❌ Coaches lose visibility into team deployment health
- ❌ No uptime monitoring or alerting
- Rejected: Monitoring is a core part of the platform's feedback mechanism
Consequences
Positive
- Students self-diagnose deployment issues via Grafana (reduced support load)
- Deployment frequency and uptime are visible — useful coaching signals
- Platform team can monitor cluster health alongside student workloads
- Consistent tooling across all tenants
Negative
- 5 components to maintain (Prometheus, Grafana, Loki, OTel, Blackbox)
- Loki on filesystem storage — logs lost if pod migrates (local-path-provisioner limitation)
- Anonymous Grafana access means no per-user audit trail
- Resource consumption: the monitoring stack itself uses non-trivial cluster resources