Skip to content

Observability Stack: Prometheus + Grafana + Loki + OpenTelemetry

Date: 2026-03-23 Status: Accepted Context: Providing monitoring and logging for the platform, with student-facing dashboards for self-service debugging

Decision

Use the CNCF observability stack: Prometheus for metrics, Loki for logs, OpenTelemetry for log collection, and Grafana for visualization. Grafana is configured with anonymous read-only access so students can view dashboards without accounts.

Architecture

graph LR
    pods["Application pods"] -->|"scrape metrics"| prom["Prometheus"]
    logs["Pod logs"] --> otel["OTel Collector<br/>(DaemonSet)"]
    otel --> loki["Loki"]
    bb["Blackbox Exporter"] -->|"probe results"| prom
    prom --> grafana["Grafana<br/>(dashboards)"]
    loki --> grafana

Key configuration choices

Component Configuration Rationale
Prometheus 7-day retention, 5GB limit Sufficient for debugging; longer history not needed for education
Loki SingleBinary mode, filesystem storage Simplest deployment; no object store needed at our scale
OTel Collector DaemonSet, filelog receiver Collects from all pods on each node; filters out DEBUG/TRACE
Grafana Anonymous access (Viewer role) Students can self-diagnose without accounts or credentials
Grafana Custom home dashboard Hides platform internals; students see only what's relevant to them
Blackbox Exporter HTTP probes per team Monitors team application uptime from outside

Rationale

  • Industry-standard CNCF tools — widely adopted, well-documented, large community. Skills transfer to industry.
  • Student-facing dashboards — Grafana gives students self-service visibility into their deployments (version history, uptime, logs) without needing kubectl access. This is a key feedback mechanism.
  • Open-source / free — commercial alternatives (Datadog, New Relic, Elastic Cloud) are not feasible for an educational platform, both due to cost and data sovereignty concerns.

Alternatives Considered

ELK stack (Elasticsearch + Logstash + Kibana)

  • ✅ Powerful full-text search
  • ❌ Significantly heavier resource footprint (Elasticsearch is memory-hungry)
  • ❌ More complex to operate
  • Rejected: Overkill for our log volume and query needs

Datadog / New Relic / commercial APM

  • ✅ Fully managed, feature-rich
  • ❌ Licensing costs not feasible for education
  • ❌ Data leaves our infrastructure
  • Rejected: Cost and data sovereignty

No monitoring (students use logs only)

  • ❌ Students lose self-service debugging capability
  • ❌ Coaches lose visibility into team deployment health
  • ❌ No uptime monitoring or alerting
  • Rejected: Monitoring is a core part of the platform's feedback mechanism

Consequences

Positive

  • Students self-diagnose deployment issues via Grafana (reduced support load)
  • Deployment frequency and uptime are visible — useful coaching signals
  • Platform team can monitor cluster health alongside student workloads
  • Consistent tooling across all tenants

Negative

  • 5 components to maintain (Prometheus, Grafana, Loki, OTel, Blackbox)
  • Loki on filesystem storage — logs lost if pod migrates (local-path-provisioner limitation)
  • Anonymous Grafana access means no per-user audit trail
  • Resource consumption: the monitoring stack itself uses non-trivial cluster resources