All resource types

What Teams Actually Need Before They’ll Let Right-Sizing Act in Production 

Most Kubernetes teams know they’re overprovisioned. The dashboards show it. The recommendations confirm it. And in most environments, the list of workloads that could be right-sized today runs into the hundreds. 

In our recent survey of 321 Kubernetes practitioners at organizations with 1,000 or more employees, 89% said automation is mission-critical or very important to their operations. Nearly 60% deploy to production automatically without manual approval. 

And yet 71% require human review before applying any CPU or memory optimization to running workloads. Only 17% have reached continuous automated right-sizing in production. 

What the data kept surfacing was a trust problem. These teams weren’t anti-automation. They were wary of letting it act where the consequences are immediate, the blast radius is real, and the accountability lands on them.  

How most teams arrive at this problem 

The path to this point is familiar enough that most platform engineers will recognize it. 

Early on, resource requests and limits don’t get the attention they deserve. Teams are moving fast. Workloads get deployed with defaults that nobody revisits after the initial push. For a while, this is fine. 

Then performance issues start surfacing. The usual response is t-shirt sizing (small, medium, large) applied across workload categories. Blunt, but effective enough to reduce the worst stability problems. Cost optimization isn’t the priority yet. 

It becomes the priority when the cloud bill lands. 

T-shirt sizes can’t optimize costs at the scale most enterprises are now operating at. So teams shift to manual tuning, targeting the biggest offenders and working down the list. This approach has a ceiling that compounds quickly. In our survey, 69% of respondents said manual optimization breaks down at roughly 250 changes per day, and 54% run 100 or more clusters. Most teams working through manual tuning eventually hit that ceiling. Automation is the obvious next step. Production is where that step stops.  

Why production is the sticking point 

That automation sits unused isn’t because teams don’t understand its value. It’s because production is a different category of risk, and most engineers who’ve been on call know exactly why. 

When code ships via CI/CD, there’s a testing pipeline behind it. If something breaks, the rollback path is well-understood, and the blast radius is usually contained. When automation adjusts CPU or memory allocations for a running workload, the failure is immediate, visible to users, and hits the platform team before anyone has had time to diagnose it. One practitioner in the survey described the calculus this way: “The risk of a potential outage caused by automated downsizing far outweighs the guaranteed financial benefit of reducing cloud costs.” 

That trade-off explains why 71% of teams require human review before any resource optimization gets applied. Production environments carry a different kind of accountability. What makes it safe to delegate a code deploy doesn’t automatically make it safe to delegate a resource change. We asked 321 practitioners directly what would. 

What the data says builds trust 

The responses clustered around four things, and better recommendations weren’t among them. 

Visibility into how the system reasons and what it changes (48%) 

Nearly half of the respondents said visibility and transparency would move the needle most. That covers two related things: understanding what the system is recommending and why before acting on it, and being able to see exactly what changed afterward. Teams want to follow the usage patterns behind a recommendation, preview how it would shift under different optimization goals, and have an audit trail they can point to when something needs explaining. Confidence in the action depends on confidence in the reasoning. 

Proven guardrails (25%) 

A quarter said their confidence depends on guardrails they control, a configuration that bounds what automation can do regardless of what the model suggests. What teams kept emphasizing was whether those guardrails were proven in practice, not just configured in theory. Teams want to know the system cannot do something they haven’t sanctioned, and they want that to be verifiable. 

Rollback confidence (23%) 

Nearly as many said rollback was the determining factor. Not the existence of a rollback mechanism, but confidence that it will fire automatically and fast enough to contain the damage. Engineers expect failures to happen; the question is whether the system catches and corrects them before the impact compounds. 

A credible path from recommendation to action 

This fourth theme didn’t show up in a single percentage, but it ran through the qualitative responses consistently enough to be unmistakable. Teams don’t want to make a single high-stakes commitment to full automation. They want to expand trust incrementally, building on evidence at each stage before going further, which maps directly to how most teams currently find themselves positioned. 

The trust curve: where most teams are, and where they’re trying to go 

Those four requirements don’t exist in isolation. They reflect a progression. Visibility tends to come first, because you need to understand what the system is doing before you can trust it to act. Guardrails come next, bounding what it can do once you do let it act. Rollback confidence follows, because at some point, something will go wrong, and how the system handles that determines whether trust is held. The progressive path is what connects all three: moving through these stages incrementally rather than committing to full automation before each condition is in place. 

Level 1: Advisory. Recommendations are generated, but humans decide everything. No automation acts without explicit approval. 

Level 2: Guardrailed. Automation can act within defined limits, but changes still require human review in most cases. 

Level 3: Conditional Autonomy. Automation handles a significant portion of changes, with human oversight reserved for production environments, high-stakes workloads, or large deviations from baseline. 

Level 4: Closed Loop. Continuous automated optimization across environments, with guardrails active but manual intervention rarely required. 

Most respondents sit at Level 1 or Level 2, applying recommendations manually, working case by case, or still relying on static resource settings with safety buffers. The move from Level 2 to Level 3 is where the trust gap is most active, and where the four requirements above do the most to close it. 

StormForge by CloudBolt

Built around all four requirements

Guardrails, visibility, rollback confidence, and a progressive path to automation. See how it works in practice.

Watch a demo

grid pattern

Before you enable auto-apply in production: a readiness checklist 

Knowing what builds trust and actually having those conditions in place are different things. The checklist below makes that concrete, with specific questions worth working through before you expand automation into any new environment or workload tier. If several items are unresolved, that’s typically where hesitation concentrates, and where it’s worth investing before moving forward. 

Guardrails 

  • Have you configured an optimization goal per workload type, rather than applying a single default across everything? 
  • Do you have minimum and maximum resource thresholds set so that automation cannot recommend values outside acceptable bounds, regardless of what the usage data suggests? 
  • Have you set a change threshold (a minimum delta) so small, noisy adjustments don’t trigger unnecessary deploys? 
  • For large recommended changes, is incremental rollout configured so significant adjustments are applied in steps rather than all at once? 
  • Have you decided on an opt-in vs. opt-out model per environment, and does it reflect how your teams actually prefer to work? 

Visibility 

  • Can you see current resource values alongside recommended values in a single view, with the usage patterns behind the recommendation? 
  • Does the system show you P95, P75, and average usage (not just a single line) so you can assess how a workload behaves across its full range? 
  • Can you preview what a recommendation would look like under a different optimization profile before committing to a configuration change? 
  • Is there an audit trail (annotations on the workload, applier logs) so your team can see what changed, when, and why? 

Rollback confidence 

  • Does the system detect workload health degradation after applying a change and automatically roll back, without requiring manual intervention? 
  • Are fast-fail conditions configured for infrastructure-level rejections (ResourceQuota violations, LimitRange failures) so rollback triggers immediately rather than waiting for a health check timeout? 
  • If an OOM event occurs after a change, does the system respond immediately, or does it wait until the next recommendation window? 
  • Have you tested the rollback path in a non-production environment so you’ve seen it work before you need it in production? 

Progressive path 

  • Have you started in read-only mode and spent enough time observing recommendations before enabling any automation? 
  • Are you starting with dev or staging before enabling production automation? 
  • Have you established a schedule cadence (weekly before daily) rather than enabling continuous optimization from the start? 
  • Can you enable automation per namespace or per workload, so you’re not making a cluster-wide commitment before you’re ready? 
  • For workloads you’re uncertain about, can you hold them in recommendation-only mode while automating everything else? 

If you can answer yes to most of these, you’re in a reasonable position to move forward. The gaps you find are worth addressing directly; they’re usually the same conditions that cause automation to stall or get rolled back after the fact. 

The full research behind this post is in the CloudBolt Industry Insights report, The Kubernetes Automation Trust Gap No One Talks About, which covers where enterprise teams currently sit on this journey and what’s keeping most of them from moving further. 

Sign up for our newsletter

Exclusive insights and strategies for cloud pros. Delivered straight to your inbox.


AUTHOR
Joanne Chu
  Learn more

Related Blogs

 
thumbnail
How to get Slack notifications when StormForge applies recommendations

The StormForge Applier does its job quietly. It watches for recommendations, applies patches to your workloads, and moves on—no fanfare,…

 
thumbnail
The VMware Shakeup Hits Europe Differently: Sovereignty Isn’t a Preference, It’s a Constraint 

If you’re watching the hypervisor market shift from Europe, the conversation sounds different from what it does in North America.  Not because…

 
thumbnail
Why Cloud Resource Optimization Is Moving Beyond Recommendations

Cloud resource optimization has typically followed this pattern: teams identify inefficiencies, generate recommendations, review them, and apply changes where it feels safe to…

X