Skip to content
Stan

All case studies

Case study

Multi-tenant isolation in elizaOS Cloud — PostgreSQL Row-Level Security, encryption & full E2E safety model

At elizaOS Cloud (10k+ users on a shared PostgreSQL backend), I moved tenant isolation from application logic into PostgreSQL using Row-Level Security (RLS), added transparent encryption for sensitive agent data, and built a full end-to-end test suite on real databases — including bidirectional migration tests — to guarantee correctness of multi-tenant isolation under production conditions.

Eliza Labs · elizaOS Cloud 8 min
  • PostgreSQL
  • Row-Level Security
  • Drizzle ORM
  • encryption-at-rest
  • multi-tenant
  • AI agents
  • elizaOS
  • E2E tests
  • migration

TL;DR

At elizaOS Cloud (10k+ users on a shared PostgreSQL backend), I moved tenant isolation from application logic into PostgreSQL using Row-Level Security (RLS), added transparent encryption for sensitive agent data, and built a full end-to-end test suite on real databases — including bidirectional migration tests — to guarantee correctness of multi-tenant isolation under production conditions.

Context

elizaOS is an open-source AI agent framework powering a cloud platform where users create and run autonomous agents (“characters”).

The cloud architecture is multi-tenant and runs on a shared PostgreSQL database. It includes multiple entity types:

  • users
  • agents (“characters”)
  • worlds (shared environments)
  • delegated execution contexts

Originally, tenant isolation was enforced at application level using patterns like:

WHERE owner_id = current_user

This created a structural risk:

any missing filter = potential cross-tenant data exposure

In a system where multiple services, contributors, and agent execution paths evolve quickly, this model is inherently fragile.

Constraints

The redesign had to respect strict constraints:

  • ~10,000 active users on shared infrastructure
  • no downtime or flag-day migration allowed
  • open-source compatibility (self-hosting must remain possible)
  • PostgreSQL + Drizzle ORM stack
  • high-frequency AI agent queries (latency-sensitive workload)
  • backward compatibility with existing datasets and deployments

The system had to be safe by design, not by discipline.

Decision — push isolation into PostgreSQL

I replaced application-level filtering with PostgreSQL Row-Level Security (RLS).

Core idea

Every database connection carries a session-scoped tenant context:

  • tenant identity is injected at connection level
  • PostgreSQL enforces row visibility automatically
  • application no longer needs to remember filtering rules

This eliminates an entire class of bugs.

Entity-level isolation model

The system is not purely user-centric. It includes multiple entity types:

  • users
  • agents
  • worlds

Agents can act on behalf of users, which requires delegation-aware isolation.

So I implemented:

  • entity-level RLS policies (not just user_id filtering)
  • delegation rules inside SQL policies
  • controlled access paths for agent execution contexts

This allows agents to operate safely without bypassing isolation boundaries.

What I built

// 01 · APPLICATION LAYER elizaOS Cloud — TypeScript runtime · Drizzle ORM developers write queries — no manual WHERE clause needed for isolation // 02 · SESSION CONTEXT INJECTED ON EVERY DB CONNECTION SET app.current_entity_id = '<current entity>' user · agent · world · delegated context — entity-level, not just user-level // 03 · POSTGRESQL — POLICIES ENFORCED ON EVERY ROW READ/WRITE PostgreSQL · entity-scoped RLS Row-Level Security policies CREATE POLICY tenant_isolation ON agent_data USING ( owner_id = current_setting( 'app.current_entity_id' )::uuid ); delegation rules built in: an agent reads its owner's data, scoped to that owner Transparent encryption layer character secrets · API keys · prompts encrypt on write decrypt on read bound to the same session context misqueried row → ciphertext, not plaintext // 04 · CONTRACT OF CORRECTNESS A forgotten filter in app code now returns zero rows — not the wrong tenant's data. Migration ran in three controlled phases (backfill → shadow → strict). No downtime.
elizaOS Cloud multi-tenant isolation — entity identity injected into the DB session, Postgres enforces RLS policies, transparent encryption layer protects character secrets. A forgotten WHERE in app code = zero rows visible. · open standalone ↗

1. PostgreSQL Row-Level Security layer

Each tenant-scoped table has RLS enabled with policies based on session context.

Example (simplified):

CREATE POLICY tenant_isolation ON agent_data
USING (
  owner_id = current_setting('app.current_tenant')::uuid
);

More advanced policies include delegation logic between users and agents.

2. Transparent encryption for sensitive agent data

Agents store sensitive fields such as:

  • API keys
  • system prompts
  • private instructions

I implemented a transparent encryption layer:

  • encryption on write
  • decryption on read
  • integrated into data access layer
  • bound to the same session context as RLS

This ensures that even if a query is mis-scoped, raw secrets remain protected.

3. Full end-to-end test suite on real PostgreSQL

A critical part of the system is a complete E2E test suite using a real PostgreSQL instance (not mocks).

It includes:

  • full multi-tenant isolation tests with RLS enabled
  • baseline tests with RLS disabled (comparison mode)
  • bidirectional migration testing:
    • without RLS → with RLS
    • with RLS → rollback scenario
  • full dataset migration simulation
  • validation of visibility rules per tenant

This ensures correctness not only in theory, but under real database behavior.

4. Safe migration strategy for legacy data

Legacy datasets had incomplete ownership metadata.

Migration was done in three controlled phases:

  1. Backfill phase
    • infer and populate ownership fields
  2. Shadow enforcement phase
    • RLS enabled in audit mode
    • full logging of visibility results
  3. Strict enforcement phase
    • full isolation enforced at database level

Each step included consistency validation and rollback capability.

5. Performance considerations

RLS introduces overhead since policies are evaluated per query.

Mitigations:

  • strong indexing on owner_id and delegation tables
  • minimal subqueries in hot paths
  • session context resolved once per connection
  • simplified policy logic where possible

In practice, overhead remained acceptable for production workloads.

Outcome

Security model transformation

The system moved from:

  • application-enforced isolation (fragile, developer-dependent)

to:

  • database-enforced isolation (deterministic, always-on)

This eliminates the entire “forgotten filter” failure class.

Encryption impact

Sensitive agent data is now:

  • encrypted at rest
  • decrypted only through controlled access paths
  • never exposed via raw SQL bypass

Migration success

The full migration was achieved:

  • without downtime
  • without flag-day cutover
  • with full validation via bidirectional tests

Testing impact (key improvement)

The most important safety guarantee is not just RLS — it is the E2E test suite on real PostgreSQL, which validates:

  • tenant isolation correctness
  • migration safety
  • rollback feasibility
  • equivalence between pre- and post-RLS behavior

This is what makes the system robust in practice.

What I would do differently

The main improvement would be to:

  • integrate RLS policy tests more directly into code review workflows
  • make policy changes require explicit multi-tenant regression checks by default

The system is already well-protected thanks to E2E testing, but policy changes are a sensitive surface area and should be even more tightly governed.


All case studies

A role to fill, or just a conversation? Let’s talk.