Skip to content

Multi-Tenant Single Cluster Architecture

Date: 2026-01-23 Status: Accepted Context: Adding SPoHF research platform alongside PRJ2 student platform

Decision

Run SPoHF and PRJ2 as tenants in a single shared Kubernetes cluster, using ResourceQuotas for isolation rather than separate clusters.

Architecture

┌─────────────────────────────────────────────────────────┐
│  Single Kubernetes Cluster (Educloud)                   │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  Shared Infrastructure:                                 │
│  - Traefik (ingress)                                    │
│  - Harbor (container registry)                          │
│  - ArgoCD (GitOps)                                      │
│  - Monitoring (Prometheus, Grafana, Loki)               │
│                                                         │
│  ┌──────────────────┐    ┌──────────────────┐          │
│  │   prj2-system    │    │   spohf-system   │          │
│  │                  │    │                  │          │
│  │  No quota        │    │  ResourceQuota:  │          │
│  │  (unrestricted)  │    │  - 8 CPU req     │          │
│  │                  │    │  - 16Gi memory   │          │
│  │  *.prod.         │    │  *.spohf.        │          │
│  │  fontysvenlo.dev │    │  fontysvenlo.dev │          │
│  └──────────────────┘    └──────────────────┘          │
│                                                         │
└─────────────────────────────────────────────────────────┘

Isolation Mechanisms

Mechanism Purpose
Namespaces Logical separation of workloads
ResourceQuota (SPoHF only) Cap SPoHF resource usage to protect students
LimitRange (SPoHF) Default pod resources, prevent unbounded requests
Separate subdomains Clear URL separation (prod. vs spohf.)

Why No Quota on PRJ2?

PRJ2 student workloads need to scale freely (30+ pods, Node.js and JVM applications). Capping prj2 could impact students during peak usage. Instead, we cap SPoHF to ensure research bursts don't starve student workloads.

Rationale

Why single cluster over two clusters?

Factor Single Cluster Two Clusters
Resource isolation ResourceQuota (sufficient) Complete (overkill)
Operational overhead 2× (upgrades, monitoring, certs)
Cost Lower Higher (2 control planes)
Scaling complexity Simple (cmk scale works) Node labels needed per cluster
Cognitive load Lower Higher

Key insight: The main concern was resource contention. ResourceQuotas provide sufficient isolation without the operational overhead of a second cluster.

Why ResourceQuota over node-level isolation?

Node selectors/taints create tight coupling with cluster scaling:

  • New nodes from cmk scaleKubernetesCluster come in unlabeled
  • Would require manual labeling after each scale operation
  • ResourceQuotas work regardless of node topology

Why quota only on SPoHF?

  • PRJ2 is the primary workload (students)
  • SPoHF is secondary (research)
  • Easier to monitor one quota than two
  • Clear priority: students > research

Alternatives Considered

Two separate clusters

  • ✅ Complete isolation
  • ❌ 2× operational overhead
  • ❌ Unnecessary for our use case
  • Rejected: Overkill for resource isolation needs

ResourceQuota on both tenants

  • ✅ Both bounded
  • ❌ Could cap student growth unexpectedly
  • Rejected: PRJ2 needs to scale freely

Node-level isolation (taints/affinity)

  • ✅ Hard isolation per node
  • ❌ Couples with cluster scaling
  • ❌ Manual labeling after scale operations
  • Rejected: Too much operational friction

PriorityClasses (eviction-based)

  • ✅ SPoHF pods evicted first under pressure
  • ❌ Reactive (see evictions after impact)
  • ❌ Harder to monitor than quotas
  • Deferred: Can add later if quotas aren't enough

Consequences

Positive

  • Simple operations (one cluster to manage)
  • Clear monitoring (quota usage visible in Prometheus)
  • Students protected from research bursts
  • Scaling remains simple

Negative

  • Shared failure domain (control plane affects both)
  • Must monitor quota to avoid SPoHF scheduling failures

Future Options

  • Add PriorityClasses if eviction control needed
  • Add NetworkPolicies if network isolation needed
  • Can still split to two clusters later if requirements change

Monitoring

Check quota usage:

kubectl describe resourcequota -n spohf-system

Prometheus query:

kube_resourcequota{namespace="spohf-system", type="used"}
  / kube_resourcequota{namespace="spohf-system", type="hard"} * 100