Multi-Tenant Single Cluster Architecture
Date: 2026-01-23 Status: Accepted Context: Adding SPoHF research platform alongside PRJ2 student platform
Decision
Run SPoHF and PRJ2 as tenants in a single shared Kubernetes cluster, using ResourceQuotas for isolation rather than separate clusters.
Architecture
┌─────────────────────────────────────────────────────────┐
│ Single Kubernetes Cluster (Educloud) │
├─────────────────────────────────────────────────────────┤
│ │
│ Shared Infrastructure: │
│ - Traefik (ingress) │
│ - Harbor (container registry) │
│ - ArgoCD (GitOps) │
│ - Monitoring (Prometheus, Grafana, Loki) │
│ │
│ ┌──────────────────┐ ┌──────────────────┐ │
│ │ prj2-system │ │ spohf-system │ │
│ │ │ │ │ │
│ │ No quota │ │ ResourceQuota: │ │
│ │ (unrestricted) │ │ - 8 CPU req │ │
│ │ │ │ - 16Gi memory │ │
│ │ *.prod. │ │ *.spohf. │ │
│ │ fontysvenlo.dev │ │ fontysvenlo.dev │ │
│ └──────────────────┘ └──────────────────┘ │
│ │
└─────────────────────────────────────────────────────────┘
Isolation Mechanisms
| Mechanism | Purpose |
|---|---|
| Namespaces | Logical separation of workloads |
| ResourceQuota (SPoHF only) | Cap SPoHF resource usage to protect students |
| LimitRange (SPoHF) | Default pod resources, prevent unbounded requests |
| Separate subdomains | Clear URL separation (prod. vs spohf.) |
Why No Quota on PRJ2?
PRJ2 student workloads need to scale freely (30+ pods, Node.js and JVM applications). Capping prj2 could impact students during peak usage. Instead, we cap SPoHF to ensure research bursts don't starve student workloads.
Rationale
Why single cluster over two clusters?
| Factor | Single Cluster | Two Clusters |
|---|---|---|
| Resource isolation | ResourceQuota (sufficient) | Complete (overkill) |
| Operational overhead | 1× | 2× (upgrades, monitoring, certs) |
| Cost | Lower | Higher (2 control planes) |
| Scaling complexity | Simple (cmk scale works) | Node labels needed per cluster |
| Cognitive load | Lower | Higher |
Key insight: The main concern was resource contention. ResourceQuotas provide sufficient isolation without the operational overhead of a second cluster.
Why ResourceQuota over node-level isolation?
Node selectors/taints create tight coupling with cluster scaling:
- New nodes from
cmk scaleKubernetesClustercome in unlabeled - Would require manual labeling after each scale operation
- ResourceQuotas work regardless of node topology
Why quota only on SPoHF?
- PRJ2 is the primary workload (students)
- SPoHF is secondary (research)
- Easier to monitor one quota than two
- Clear priority: students > research
Alternatives Considered
Two separate clusters
- ✅ Complete isolation
- ❌ 2× operational overhead
- ❌ Unnecessary for our use case
- Rejected: Overkill for resource isolation needs
ResourceQuota on both tenants
- ✅ Both bounded
- ❌ Could cap student growth unexpectedly
- Rejected: PRJ2 needs to scale freely
Node-level isolation (taints/affinity)
- ✅ Hard isolation per node
- ❌ Couples with cluster scaling
- ❌ Manual labeling after scale operations
- Rejected: Too much operational friction
PriorityClasses (eviction-based)
- ✅ SPoHF pods evicted first under pressure
- ❌ Reactive (see evictions after impact)
- ❌ Harder to monitor than quotas
- Deferred: Can add later if quotas aren't enough
Consequences
Positive
- Simple operations (one cluster to manage)
- Clear monitoring (quota usage visible in Prometheus)
- Students protected from research bursts
- Scaling remains simple
Negative
- Shared failure domain (control plane affects both)
- Must monitor quota to avoid SPoHF scheduling failures
Future Options
- Add PriorityClasses if eviction control needed
- Add NetworkPolicies if network isolation needed
- Can still split to two clusters later if requirements change
Monitoring
Check quota usage:
kubectl describe resourcequota -n spohf-system
Prometheus query:
kube_resourcequota{namespace="spohf-system", type="used"}
/ kube_resourcequota{namespace="spohf-system", type="hard"} * 100