Skip to content
Stan

All case studies

Case study

Progressive delivery at Bedrock — Argo Rollouts for 50+ product teams

Rolled out Argo Rollouts as the standard progressive-delivery framework for 50+ product teams. Replaced health-check gates with metric-based gates (Apdex, error rate, custom KPIs). Contributed upstream support for KEDA + Gloo Gateway.

Bedrock Streaming 8 min
  • Argo Rollouts
  • ArgoCD
  • Kubernetes
  • Helm
  • KEDA
  • Gloo Gateway
  • New Relic
  • Apdex
  • Gateway API
  • GitHub Actions

TL;DR

At Bedrock Streaming (20M+ weekly viewers, 1000+ Kubernetes nodes), I rolled out Argo Rollouts as the standard progressive-delivery framework for 50+ product teams. Metric-based gates (Apdex, error rate, custom KPIs) replaced basic health checks, and I contributed upstream support so Rollouts could work with our KEDA and Gloo Gateway network stack.

Context

Bedrock operates large European streaming platforms under the M6 / RTL umbrella. Deployments are not abstract engineering exercises: when something degrades, paying users feel it immediately. Traffic is bursty, highly time-dependent (prime time, live events), and very different from staging patterns.

Before this project, deployments were driven by GitHub Actions and a set of home-grown templates that had grown organically over years. They worked, but they were fragile under change. Every team had slightly different needs. Every modification to the templates risked breaking someone else’s pipeline. The platform team had become the implicit rollout babysitter for 50+ product teams.

The only real gate was Kubernetes health checks. If a pod was Ready, the platform considered the deployment successful. But a pod can be perfectly healthy while returning 5xx for a specific code path or adding 200 ms latency to a hot endpoint. Those regressions were invisible to the deployment system.

This created a recurring pattern: a deployment would go out, metrics would slowly drift, support tickets would start to rise, dashboards would light up, and the platform team would get pulled in to coordinate rollback and diagnosis. The system did not fail fast; it failed slowly and noisily.

That model was not sustainable at Bedrock’s scale.

Constraints

Several constraints made this problem interesting:

  • Real users, real money, real-time impact. No safe deployment windows.
  • Teams owned their releases. The platform could not become a gatekeeper or a bottleneck.
  • The network stack was not “vanilla”: we were using Gloo as an ingress/gateway instead of NGINX for advanced routing features.
  • Parts of the platform relied on KEDA for event-driven autoscaling and scale-to-zero.
  • The cluster footprint exceeded 1000 nodes, operated by a small core infrastructure team.
  • Not all workloads were HTTP stateless services: we had session-sensitive services, Kafka/SQS workers, and complex schema migrations.

Whatever I introduced had to work across this heterogeneity without requiring application rewrites.

Decision

The key insight was to decouple deployment from CI and to move deployment strategy into Kubernetes itself.

ArgoCD became the GitOps engine. Argo Rollouts became the deployment controller.

Why Argo Rollouts?

  • Native support for AnalysisTemplates and metric-based gates.
  • No need to modify applications.
  • Strong GitOps integration, fitting Bedrock’s operating model.
  • CRD-driven, which meant I could standardize behavior via Helm templates.

Alternatives like Flagger or heavier CD platforms were either too opinionated, less flexible for our custom metrics, or did not fit the GitOps approach we already had.

I had two explicit goals:

  1. Remove the platform team from daily rollout babysitting.
  2. Push real ownership of release strategy back to product teams.

What I built

// 01 · GITOPS TRIGGER git push Helm chart bump ArgoCD sync to cluster Argo Rollouts controller drives the canary spec // 02 · PROGRESSIVE TRAFFIC SHIFT — EACH STEP GATED BY AN ANALYSISRUN 5% canary traffic pause 10 min AnalysisRun strict thresholds 25% canary traffic pause 10 min AnalysisRun 50% canary traffic pause 10 min AnalysisRun 100% promote to stable canary becomes stable New Relic queried · Apdex · 5xx rate · success rate · business KPIs // 03 · BREACH ON ANY GATE → AUTO-ROLLBACK, CANARY POD KILLED Auto-rollback e.g. Apdex drops by 0.05 at 5% → fleet reverts in seconds, 95% of traffic stayed safe // 04 · TRAFFIC PLANE — GLOO GATEWAY (HTTPROUTE WEIGHTS) Gloo Gateway · HTTPRoute weights set by Argo Rollouts plugin Stable pods · weight = 100 − canary% Canary pods · weight = canary% · upstream contrib for Gateway API + KEDA support
Argo Rollouts canary pipeline — each traffic step opens an AnalysisRun against New Relic. Apdex, 5xx or business-KPI breach → automatic rollback. · open standalone ↗

Helm library with sane defaults

I created a library of Helm templates wrapping Rollout resources with sensible defaults:

  • Canary as the default strategy for stateless HTTP services.
  • Blue/green override for session-sensitive services where v1 and v2 could not coexist.
  • Plain rolling update for queue workers (Kafka/SQS) where traffic splitting makes no sense, combined with queue lag monitoring.

Teams could override parameters, but they did not have to understand Rollouts deeply to benefit from it.

Metric-based gates

The core of the system was AnalysisTemplates wired to:

  • New Relic Apdex score
  • 5xx error rate with progressively tightening thresholds
  • Success rate
  • Response time
  • Optional team-specific business metrics

This replaced the naive “pod is up” gate with three dimensions:

  1. Is the pod running?
  2. Is it returning correct responses?
  3. Is user experience degraded?

Canary step pattern

A typical pattern looked like:

  • 5% traffic — hold 10 minutes
  • 25% — hold 10 minutes
  • 50% — hold 10 minutes
  • 100%

At each step, thresholds tightened. What is acceptable at 5% is not acceptable at 50%.

Stateful workloads discipline

I documented and enforced:

  • Expand-and-contract schema migration pattern
  • Clear rule: you don’t canary databases
  • Blue/green for in-memory session services
  • Queue lag monitoring for workers

This gave teams a playbook instead of tribal knowledge.

The Gloo + KEDA problem (and upstream contribution)

This is where things became non-standard.

Argo Rollouts worked perfectly with NGINX ingress. We were using Gloo as a gateway and KEDA for autoscaling. At the time, Rollouts did not have proper support for dynamically managing traffic weights through this stack.

I ended up contributing upstream to the Rollouts Gateway API plugin so it could:

  • Discover routes dynamically using label selectors
  • Support HTTPRoute, GRPCRoute, and TCPRoute
  • Adjust weights automatically during canary steps

This avoided us maintaining a long-term fork and allowed progressive delivery to work on parts of the platform that were not on the “standard” ingress path.

The key idea was: instead of statically listing routes, the plugin could discover them at runtime based on labels like rollout-enabled: "true".

Onboarding teams

Onboarding was progressive:

  1. Staging first, with paired sessions.
  2. Documentation and examples.
  3. Production migration once teams were comfortable.

This was not a big-bang migration. Over months, teams adopted the pattern as they touched their services.

Outcome (and the two stories)

True positive — where Rollouts paid for itself

A backend service in the streaming path shipped a regression on a hot endpoint. Under real traffic, a query plan hit the database harder than expected. Staging never showed it.

In production, the canary received ~5% of traffic. The Apdex score on the canary slice dropped below threshold within minutes. The AnalysisRun aborted automatically.

The stable version continued to serve the other 95%. User impact was limited to a small slice for a few minutes.

Without Rollouts, this would have rolled out fleet-wide on a normal rolling update.

False positive — the lesson

We also had cases where canaries rolled back because an upstream dependency was degraded. The service under test correctly returned 5xx propagated from upstream, gates fired, rollout aborted.

The code was fine. The environment was not.

That’s when I formulated the rule:

A canary gate should measure the quality of the change you just shipped, not the health of the whole call graph.

I refined the queries to distinguish error origins and bypass specific upstream-attributable classes.

Hard outcomes

  • 50+ teams using progressive delivery with real metrics.
  • Platform team no longer in the critical path of deployments.
  • Clear discipline for stateful workloads.
  • Upstream contributions so Bedrock was not carrying internal forks.

What I’d do differently

I would add gate observability from day one.

Initially, when a rollout aborted, teams saw: “AnalysisRun failed.” That created frustration. Later, I added Slack notifications with the exact metric values and the threshold that tripped.

That should have existed from the start.


All case studies

A role to fill, or just a conversation? Let’s talk.