Observability Stack: Prometheus + Grafana + Loki + OpenTelemetry

Date: 2026-03-23 Status: Accepted Context: Providing monitoring and logging for the platform, with student-facing dashboards for self-service debugging

Decision

Use the CNCF observability stack: Prometheus for metrics, Loki for logs, OpenTelemetry for log collection, and Grafana for visualization. Grafana is configured with anonymous read-only access so students can view dashboards without accounts.

Architecture

graph LR
    pods["Application pods"] -->|"scrape metrics"| prom["Prometheus"]
    logs["Pod logs"] --> otel["OTel Collector<br/>(DaemonSet)"]
    otel --> loki["Loki"]
    bb["Blackbox Exporter"] -->|"probe results"| prom
    prom --> grafana["Grafana<br/>(dashboards)"]
    loki --> grafana

Key configuration choices

Component	Configuration	Rationale
Prometheus	7-day retention, 5GB limit	Sufficient for debugging; longer history not needed for education
Loki	SingleBinary mode, filesystem storage	Simplest deployment; no object store needed at our scale
OTel Collector	DaemonSet, filelog receiver	Collects from all pods on each node; filters out DEBUG/TRACE
Grafana	Anonymous access (Viewer role)	Students can self-diagnose without accounts or credentials
Grafana	Custom home dashboard	Hides platform internals; students see only what's relevant to them
Blackbox Exporter	HTTP probes per team	Monitors team application uptime from outside

Rationale

Industry-standard CNCF tools — widely adopted, well-documented, large community. Skills transfer to industry.
Student-facing dashboards — Grafana gives students self-service visibility into their deployments (version history, uptime, logs) without needing kubectl access. This is a key feedback mechanism.
Open-source / free — commercial alternatives (Datadog, New Relic, Elastic Cloud) are not feasible for an educational platform, both due to cost and data sovereignty concerns.

Alternatives Considered

ELK stack (Elasticsearch + Logstash + Kibana)

✅ Powerful full-text search
❌ Significantly heavier resource footprint (Elasticsearch is memory-hungry)
❌ More complex to operate
Rejected: Overkill for our log volume and query needs

Datadog / New Relic / commercial APM

✅ Fully managed, feature-rich
❌ Licensing costs not feasible for education
❌ Data leaves our infrastructure
Rejected: Cost and data sovereignty

No monitoring (students use logs only)

❌ Students lose self-service debugging capability
❌ Coaches lose visibility into team deployment health
❌ No uptime monitoring or alerting
Rejected: Monitoring is a core part of the platform's feedback mechanism

Consequences

Positive

Students self-diagnose deployment issues via Grafana (reduced support load)
Deployment frequency and uptime are visible — useful coaching signals
Platform team can monitor cluster health alongside student workloads
Consistent tooling across all tenants

Negative

5 components to maintain (Prometheus, Grafana, Loki, OTel, Blackbox)
Loki on filesystem storage — logs lost if pod migrates (local-path-provisioner limitation)
Anonymous Grafana access means no per-user audit trail
Resource consumption: the monitoring stack itself uses non-trivial cluster resources