Case study
Multi-tenant isolation in elizaOS Cloud — PostgreSQL Row-Level Security, encryption & full E2E safety model
At elizaOS Cloud (10k+ users on a shared PostgreSQL backend), I moved tenant isolation from application logic into PostgreSQL using Row-Level Security (RLS), added transparent encryption for sensitive agent data, and built a full end-to-end test suite on real databases — including bidirectional migration tests — to guarantee correctness of multi-tenant isolation under production conditions.
- PostgreSQL
- Row-Level Security
- Drizzle ORM
- encryption-at-rest
- multi-tenant
- AI agents
- elizaOS
- E2E tests
- migration
TL;DR
At elizaOS Cloud (10k+ users on a shared PostgreSQL backend), I moved tenant isolation from application logic into PostgreSQL using Row-Level Security (RLS), added transparent encryption for sensitive agent data, and built a full end-to-end test suite on real databases — including bidirectional migration tests — to guarantee correctness of multi-tenant isolation under production conditions.
Context
elizaOS is an open-source AI agent framework powering a cloud platform where users create and run autonomous agents (“characters”).
The cloud architecture is multi-tenant and runs on a shared PostgreSQL database. It includes multiple entity types:
- users
- agents (“characters”)
- worlds (shared environments)
- delegated execution contexts
Originally, tenant isolation was enforced at application level using patterns like:
WHERE owner_id = current_user
This created a structural risk:
any missing filter = potential cross-tenant data exposure
In a system where multiple services, contributors, and agent execution paths evolve quickly, this model is inherently fragile.
Constraints
The redesign had to respect strict constraints:
- ~10,000 active users on shared infrastructure
- no downtime or flag-day migration allowed
- open-source compatibility (self-hosting must remain possible)
- PostgreSQL + Drizzle ORM stack
- high-frequency AI agent queries (latency-sensitive workload)
- backward compatibility with existing datasets and deployments
The system had to be safe by design, not by discipline.
Decision — push isolation into PostgreSQL
I replaced application-level filtering with PostgreSQL Row-Level Security (RLS).
Core idea
Every database connection carries a session-scoped tenant context:
- tenant identity is injected at connection level
- PostgreSQL enforces row visibility automatically
- application no longer needs to remember filtering rules
This eliminates an entire class of bugs.
Entity-level isolation model
The system is not purely user-centric. It includes multiple entity types:
- users
- agents
- worlds
Agents can act on behalf of users, which requires delegation-aware isolation.
So I implemented:
- entity-level RLS policies (not just
user_idfiltering) - delegation rules inside SQL policies
- controlled access paths for agent execution contexts
This allows agents to operate safely without bypassing isolation boundaries.
What I built
1. PostgreSQL Row-Level Security layer
Each tenant-scoped table has RLS enabled with policies based on session context.
Example (simplified):
CREATE POLICY tenant_isolation ON agent_data
USING (
owner_id = current_setting('app.current_tenant')::uuid
);
More advanced policies include delegation logic between users and agents.
2. Transparent encryption for sensitive agent data
Agents store sensitive fields such as:
- API keys
- system prompts
- private instructions
I implemented a transparent encryption layer:
- encryption on write
- decryption on read
- integrated into data access layer
- bound to the same session context as RLS
This ensures that even if a query is mis-scoped, raw secrets remain protected.
3. Full end-to-end test suite on real PostgreSQL
A critical part of the system is a complete E2E test suite using a real PostgreSQL instance (not mocks).
It includes:
- full multi-tenant isolation tests with RLS enabled
- baseline tests with RLS disabled (comparison mode)
- bidirectional migration testing:
- without RLS → with RLS
- with RLS → rollback scenario
- full dataset migration simulation
- validation of visibility rules per tenant
This ensures correctness not only in theory, but under real database behavior.
4. Safe migration strategy for legacy data
Legacy datasets had incomplete ownership metadata.
Migration was done in three controlled phases:
- Backfill phase
- infer and populate ownership fields
- Shadow enforcement phase
- RLS enabled in audit mode
- full logging of visibility results
- Strict enforcement phase
- full isolation enforced at database level
Each step included consistency validation and rollback capability.
5. Performance considerations
RLS introduces overhead since policies are evaluated per query.
Mitigations:
- strong indexing on
owner_idand delegation tables - minimal subqueries in hot paths
- session context resolved once per connection
- simplified policy logic where possible
In practice, overhead remained acceptable for production workloads.
Outcome
Security model transformation
The system moved from:
- application-enforced isolation (fragile, developer-dependent)
to:
- database-enforced isolation (deterministic, always-on)
This eliminates the entire “forgotten filter” failure class.
Encryption impact
Sensitive agent data is now:
- encrypted at rest
- decrypted only through controlled access paths
- never exposed via raw SQL bypass
Migration success
The full migration was achieved:
- without downtime
- without flag-day cutover
- with full validation via bidirectional tests
Testing impact (key improvement)
The most important safety guarantee is not just RLS — it is the E2E test suite on real PostgreSQL, which validates:
- tenant isolation correctness
- migration safety
- rollback feasibility
- equivalence between pre- and post-RLS behavior
This is what makes the system robust in practice.
What I would do differently
The main improvement would be to:
- integrate RLS policy tests more directly into code review workflows
- make policy changes require explicit multi-tenant regression checks by default
The system is already well-protected thanks to E2E testing, but policy changes are a sensitive surface area and should be even more tightly governed.
Keep reading
-
Bedrock Streaming · 8 min
Progressive delivery at Bedrock — Argo Rollouts for 50+ product teams
Rolled out Argo Rollouts as the standard progressive-delivery framework for 50+ product teams. Replaced health-check gates with metric-based gates (Apdex, error rate, custom KPIs). Contributed upstream support for KEDA + Gloo Gateway.
Read
-
Bedrock Streaming · 7 min
Bedrock load-testing platform — 10× capacity at 90% lower cost
Rebuilt Bedrock's load-testing platform on a hybrid Amazon EC2 + ECS architecture integrated with Gatling Enterprise Cloud. Result: ~10× more capacity at ~90% lower cost — by treating FinOps as an architecture problem, not a finance ticket.
Read
-
Enedis · via Klanik · 8 min
DevOps at scale on critical infrastructure — GitLab CI forge for 450+ apps & Chaos Engineering GameDays at Enedis (via Klanik)
At Enedis, France's primary electricity distribution operator, I helped build from scratch a Kubernetes-based GitLab CI forge serving 450+ applications, then co-designed large-scale Chaos Engineering GameDays simulating DDoS, database corruption and secret leaks. The common goal: make teams anticipate failure instead of reacting to it.
Read