Case study
Progressive delivery at Bedrock — Argo Rollouts for 50+ product teams
Rolled out Argo Rollouts as the standard progressive-delivery framework for 50+ product teams. Replaced health-check gates with metric-based gates (Apdex, error rate, custom KPIs). Contributed upstream support for KEDA + Gloo Gateway.
- Argo Rollouts
- ArgoCD
- Kubernetes
- Helm
- KEDA
- Gloo Gateway
- New Relic
- Apdex
- Gateway API
- GitHub Actions
TL;DR
At Bedrock Streaming (20M+ weekly viewers, 1000+ Kubernetes nodes), I rolled out Argo Rollouts as the standard progressive-delivery framework for 50+ product teams. Metric-based gates (Apdex, error rate, custom KPIs) replaced basic health checks, and I contributed upstream support so Rollouts could work with our KEDA and Gloo Gateway network stack.
Context
Bedrock operates large European streaming platforms under the M6 / RTL umbrella. Deployments are not abstract engineering exercises: when something degrades, paying users feel it immediately. Traffic is bursty, highly time-dependent (prime time, live events), and very different from staging patterns.
Before this project, deployments were driven by GitHub Actions and a set of home-grown templates that had grown organically over years. They worked, but they were fragile under change. Every team had slightly different needs. Every modification to the templates risked breaking someone else’s pipeline. The platform team had become the implicit rollout babysitter for 50+ product teams.
The only real gate was Kubernetes health checks. If a pod was Ready, the platform considered the deployment successful. But a pod can be perfectly healthy while returning 5xx for a specific code path or adding 200 ms latency to a hot endpoint. Those regressions were invisible to the deployment system.
This created a recurring pattern: a deployment would go out, metrics would slowly drift, support tickets would start to rise, dashboards would light up, and the platform team would get pulled in to coordinate rollback and diagnosis. The system did not fail fast; it failed slowly and noisily.
That model was not sustainable at Bedrock’s scale.
Constraints
Several constraints made this problem interesting:
- Real users, real money, real-time impact. No safe deployment windows.
- Teams owned their releases. The platform could not become a gatekeeper or a bottleneck.
- The network stack was not “vanilla”: we were using Gloo as an ingress/gateway instead of NGINX for advanced routing features.
- Parts of the platform relied on KEDA for event-driven autoscaling and scale-to-zero.
- The cluster footprint exceeded 1000 nodes, operated by a small core infrastructure team.
- Not all workloads were HTTP stateless services: we had session-sensitive services, Kafka/SQS workers, and complex schema migrations.
Whatever I introduced had to work across this heterogeneity without requiring application rewrites.
Decision
The key insight was to decouple deployment from CI and to move deployment strategy into Kubernetes itself.
ArgoCD became the GitOps engine. Argo Rollouts became the deployment controller.
Why Argo Rollouts?
- Native support for
AnalysisTemplates and metric-based gates. - No need to modify applications.
- Strong GitOps integration, fitting Bedrock’s operating model.
- CRD-driven, which meant I could standardize behavior via Helm templates.
Alternatives like Flagger or heavier CD platforms were either too opinionated, less flexible for our custom metrics, or did not fit the GitOps approach we already had.
I had two explicit goals:
- Remove the platform team from daily rollout babysitting.
- Push real ownership of release strategy back to product teams.
What I built
Helm library with sane defaults
I created a library of Helm templates wrapping Rollout resources with sensible defaults:
- Canary as the default strategy for stateless HTTP services.
- Blue/green override for session-sensitive services where v1 and v2 could not coexist.
- Plain rolling update for queue workers (Kafka/SQS) where traffic splitting makes no sense, combined with queue lag monitoring.
Teams could override parameters, but they did not have to understand Rollouts deeply to benefit from it.
Metric-based gates
The core of the system was AnalysisTemplates wired to:
- New Relic Apdex score
- 5xx error rate with progressively tightening thresholds
- Success rate
- Response time
- Optional team-specific business metrics
This replaced the naive “pod is up” gate with three dimensions:
- Is the pod running?
- Is it returning correct responses?
- Is user experience degraded?
Canary step pattern
A typical pattern looked like:
- 5% traffic — hold 10 minutes
- 25% — hold 10 minutes
- 50% — hold 10 minutes
- 100%
At each step, thresholds tightened. What is acceptable at 5% is not acceptable at 50%.
Stateful workloads discipline
I documented and enforced:
- Expand-and-contract schema migration pattern
- Clear rule: you don’t canary databases
- Blue/green for in-memory session services
- Queue lag monitoring for workers
This gave teams a playbook instead of tribal knowledge.
The Gloo + KEDA problem (and upstream contribution)
This is where things became non-standard.
Argo Rollouts worked perfectly with NGINX ingress. We were using Gloo as a gateway and KEDA for autoscaling. At the time, Rollouts did not have proper support for dynamically managing traffic weights through this stack.
I ended up contributing upstream to the Rollouts Gateway API plugin so it could:
- Discover routes dynamically using label selectors
- Support
HTTPRoute,GRPCRoute, andTCPRoute - Adjust weights automatically during canary steps
This avoided us maintaining a long-term fork and allowed progressive delivery to work on parts of the platform that were not on the “standard” ingress path.
The key idea was: instead of statically listing routes, the plugin could discover them at runtime based on labels like rollout-enabled: "true".
Onboarding teams
Onboarding was progressive:
- Staging first, with paired sessions.
- Documentation and examples.
- Production migration once teams were comfortable.
This was not a big-bang migration. Over months, teams adopted the pattern as they touched their services.
Outcome (and the two stories)
True positive — where Rollouts paid for itself
A backend service in the streaming path shipped a regression on a hot endpoint. Under real traffic, a query plan hit the database harder than expected. Staging never showed it.
In production, the canary received ~5% of traffic. The Apdex score on the canary slice dropped below threshold within minutes. The AnalysisRun aborted automatically.
The stable version continued to serve the other 95%. User impact was limited to a small slice for a few minutes.
Without Rollouts, this would have rolled out fleet-wide on a normal rolling update.
False positive — the lesson
We also had cases where canaries rolled back because an upstream dependency was degraded. The service under test correctly returned 5xx propagated from upstream, gates fired, rollout aborted.
The code was fine. The environment was not.
That’s when I formulated the rule:
A canary gate should measure the quality of the change you just shipped, not the health of the whole call graph.
I refined the queries to distinguish error origins and bypass specific upstream-attributable classes.
Hard outcomes
- 50+ teams using progressive delivery with real metrics.
- Platform team no longer in the critical path of deployments.
- Clear discipline for stateful workloads.
- Upstream contributions so Bedrock was not carrying internal forks.
What I’d do differently
I would add gate observability from day one.
Initially, when a rollout aborted, teams saw: “AnalysisRun failed.” That created frustration. Later, I added Slack notifications with the exact metric values and the threshold that tripped.
That should have existed from the start.
Keep reading
-
Bedrock Streaming · 7 min
Bedrock load-testing platform — 10× capacity at 90% lower cost
Rebuilt Bedrock's load-testing platform on a hybrid Amazon EC2 + ECS architecture integrated with Gatling Enterprise Cloud. Result: ~10× more capacity at ~90% lower cost — by treating FinOps as an architecture problem, not a finance ticket.
Read
-
Enedis · via Klanik · 8 min
DevOps at scale on critical infrastructure — GitLab CI forge for 450+ apps & Chaos Engineering GameDays at Enedis (via Klanik)
At Enedis, France's primary electricity distribution operator, I helped build from scratch a Kubernetes-based GitLab CI forge serving 450+ applications, then co-designed large-scale Chaos Engineering GameDays simulating DDoS, database corruption and secret leaks. The common goal: make teams anticipate failure instead of reacting to it.
Read
-
Eliza Labs · elizaOS Cloud · 8 min
Multi-tenant isolation in elizaOS Cloud — PostgreSQL Row-Level Security, encryption & full E2E safety model
At elizaOS Cloud (10k+ users on a shared PostgreSQL backend), I moved tenant isolation from application logic into PostgreSQL using Row-Level Security (RLS), added transparent encryption for sensitive agent data, and built a full end-to-end test suite on real databases — including bidirectional migration tests — to guarantee correctness of multi-tenant isolation under production conditions.
Read