Multi-Tenant Single Cluster Architecture

Date: 2026-01-23 Status: Accepted Context: Adding SPoHF research platform alongside PRJ2 student platform

Decision

Run SPoHF and PRJ2 as tenants in a single shared Kubernetes cluster, using ResourceQuotas for isolation rather than separate clusters.

Architecture

graph TB
    classDef shared fill:#438DD5,stroke:#3C7FC0,color:#fff
    classDef prj2 fill:#2D9CDB,stroke:#2691CC,color:#fff
    classDef spohf fill:#F2994A,stroke:#D98A42,color:#fff

    subgraph cluster ["Single Kubernetes Cluster (Educloud)"]
        direction TB
        subgraph shared_infra ["Shared Infrastructure"]
            traefik["Traefik"]:::shared
            harbor["Harbor"]:::shared
            argocd["ArgoCD"]:::shared
            monitoring["Prometheus + Grafana + Loki"]:::shared
        end
        subgraph prj2_tenant ["PRJ2 — No quota (unrestricted)<br/>*.prod.fontysvenlo.dev"]
            prj2sys["prj2-system"]:::prj2
        end
        subgraph spohf_tenant ["SPoHF — ResourceQuota: 8 CPU / 16Gi<br/>*.spohf.fontysvenlo.dev"]
            spohfsys["spohf-system"]:::spohf
        end
    end

Isolation Mechanisms

Mechanism	Purpose
Namespaces	Logical separation of workloads
ResourceQuota (SPoHF only)	Cap SPoHF resource usage to protect students
LimitRange (SPoHF)	Default pod resources, prevent unbounded requests
Separate subdomains	Clear URL separation (prod. vs spohf.)

Why No Quota on PRJ2?

PRJ2 student workloads need to scale freely (30+ pods, Node.js and JVM applications). Capping prj2 could impact students during peak usage. Instead, we cap SPoHF to ensure research bursts don't starve student workloads.

Rationale

Why single cluster over two clusters?

Factor	Single Cluster	Two Clusters
Resource isolation	ResourceQuota (sufficient)	Complete (overkill)
Operational overhead	1×	2× (upgrades, monitoring, certs)
Cost	Lower	Higher (2 control planes)
Scaling complexity	Simple (cmk scale works)	Node labels needed per cluster
Cognitive load	Lower	Higher

Key insight: The main concern was resource contention. ResourceQuotas provide sufficient isolation without the operational overhead of a second cluster.

Why ResourceQuota over node-level isolation?

Node selectors/taints create tight coupling with cluster scaling:

New nodes from cmk scaleKubernetesCluster come in unlabeled
Would require manual labeling after each scale operation
ResourceQuotas work regardless of node topology

Why quota only on SPoHF?

PRJ2 is the primary workload (students)
SPoHF is secondary (research)
Easier to monitor one quota than two
Clear priority: students > research

Alternatives Considered

Two separate clusters

✅ Complete isolation
❌ 2× operational overhead
❌ Unnecessary for our use case
Rejected: Overkill for resource isolation needs

ResourceQuota on both tenants

✅ Both bounded
❌ Could cap student growth unexpectedly
Rejected: PRJ2 needs to scale freely

Node-level isolation (taints/affinity)

✅ Hard isolation per node
❌ Couples with cluster scaling
❌ Manual labeling after scale operations
Rejected: Too much operational friction

PriorityClasses (eviction-based)

✅ SPoHF pods evicted first under pressure
❌ Reactive (see evictions after impact)
❌ Harder to monitor than quotas
Deferred: Can add later if quotas aren't enough

Consequences

Positive

Simple operations (one cluster to manage)
Clear monitoring (quota usage visible in Prometheus)
Students protected from research bursts
Scaling remains simple

Negative

Shared failure domain (control plane affects both)
Must monitor quota to avoid SPoHF scheduling failures

Future Options

Add PriorityClasses if eviction control needed
Add NetworkPolicies if network isolation needed
Can still split to two clusters later if requirements change

Monitoring

Check quota usage:

kubectl describe resourcequota -n spohf-system

Prometheus query:

kube_resourcequota{namespace="spohf-system", type="used"}
  / kube_resourcequota{namespace="spohf-system", type="hard"} * 100