Multi-Tenant Single Cluster Architecture
Date: 2026-01-23 Status: Accepted Context: Adding SPoHF research platform alongside PRJ2 student platform
Decision
Run SPoHF and PRJ2 as tenants in a single shared Kubernetes cluster, using ResourceQuotas for isolation rather than separate clusters.
Architecture
graph TB
classDef shared fill:#438DD5,stroke:#3C7FC0,color:#fff
classDef prj2 fill:#2D9CDB,stroke:#2691CC,color:#fff
classDef spohf fill:#F2994A,stroke:#D98A42,color:#fff
subgraph cluster ["Single Kubernetes Cluster (Educloud)"]
direction TB
subgraph shared_infra ["Shared Infrastructure"]
traefik["Traefik"]:::shared
harbor["Harbor"]:::shared
argocd["ArgoCD"]:::shared
monitoring["Prometheus + Grafana + Loki"]:::shared
end
subgraph prj2_tenant ["PRJ2 — No quota (unrestricted)<br/>*.prod.fontysvenlo.dev"]
prj2sys["prj2-system"]:::prj2
end
subgraph spohf_tenant ["SPoHF — ResourceQuota: 8 CPU / 16Gi<br/>*.spohf.fontysvenlo.dev"]
spohfsys["spohf-system"]:::spohf
end
end
Isolation Mechanisms
| Mechanism | Purpose |
|---|---|
| Namespaces | Logical separation of workloads |
| ResourceQuota (SPoHF only) | Cap SPoHF resource usage to protect students |
| LimitRange (SPoHF) | Default pod resources, prevent unbounded requests |
| Separate subdomains | Clear URL separation (prod. vs spohf.) |
Why No Quota on PRJ2?
PRJ2 student workloads need to scale freely (30+ pods, Node.js and JVM applications). Capping prj2 could impact students during peak usage. Instead, we cap SPoHF to ensure research bursts don't starve student workloads.
Rationale
Why single cluster over two clusters?
| Factor | Single Cluster | Two Clusters |
|---|---|---|
| Resource isolation | ResourceQuota (sufficient) | Complete (overkill) |
| Operational overhead | 1× | 2× (upgrades, monitoring, certs) |
| Cost | Lower | Higher (2 control planes) |
| Scaling complexity | Simple (cmk scale works) | Node labels needed per cluster |
| Cognitive load | Lower | Higher |
Key insight: The main concern was resource contention. ResourceQuotas provide sufficient isolation without the operational overhead of a second cluster.
Why ResourceQuota over node-level isolation?
Node selectors/taints create tight coupling with cluster scaling:
- New nodes from
cmk scaleKubernetesClustercome in unlabeled - Would require manual labeling after each scale operation
- ResourceQuotas work regardless of node topology
Why quota only on SPoHF?
- PRJ2 is the primary workload (students)
- SPoHF is secondary (research)
- Easier to monitor one quota than two
- Clear priority: students > research
Alternatives Considered
Two separate clusters
- ✅ Complete isolation
- ❌ 2× operational overhead
- ❌ Unnecessary for our use case
- Rejected: Overkill for resource isolation needs
ResourceQuota on both tenants
- ✅ Both bounded
- ❌ Could cap student growth unexpectedly
- Rejected: PRJ2 needs to scale freely
Node-level isolation (taints/affinity)
- ✅ Hard isolation per node
- ❌ Couples with cluster scaling
- ❌ Manual labeling after scale operations
- Rejected: Too much operational friction
PriorityClasses (eviction-based)
- ✅ SPoHF pods evicted first under pressure
- ❌ Reactive (see evictions after impact)
- ❌ Harder to monitor than quotas
- Deferred: Can add later if quotas aren't enough
Consequences
Positive
- Simple operations (one cluster to manage)
- Clear monitoring (quota usage visible in Prometheus)
- Students protected from research bursts
- Scaling remains simple
Negative
- Shared failure domain (control plane affects both)
- Must monitor quota to avoid SPoHF scheduling failures
Future Options
- Add PriorityClasses if eviction control needed
- Add NetworkPolicies if network isolation needed
- Can still split to two clusters later if requirements change
Monitoring
Check quota usage:
kubectl describe resourcequota -n spohf-system
Prometheus query:
kube_resourcequota{namespace="spohf-system", type="used"}
/ kube_resourcequota{namespace="spohf-system", type="hard"} * 100