KubeCon 2025: Three Things This Year’s Conversations Told Me About Kubernetes Optimization
One of the most interesting things about going to KubeCon every year is how familiar the conversations about resource management still are.
I’ve been to a few now, and the pattern doesn’t really change much. When people find us at the booth and we start talking about resource optimization, they immediately understand the problem. Then they walk through almost the same story:
- When you’re just getting started with Kubernetes, resource management doesn’t feel urgent. You’re not spending much, and the main goal is just to get things running.
- As you scale up, reliability and stability issues start to show up first.
- The quick fix is to throw more resources at everything—t-shirt sizing, generous requests, lots of headroom.
- Eventually the cloud bill (or an internal “why is this so expensive?” conversation) forces you to look more closely.
- Someone tries to manually tune requests and limits. That works for a while, until the number of services and clusters makes it impossible to keep up.
- Only then do most teams seriously consider automation.
What struck me this year is how little that fundamental pattern has changed. To me, that suggests two things:
- Resource management is a foundational problem, not a passing concern. Every wave of Kubernetes adoption seems to re-discover it.
- The hardest part now isn’t that the technology doesn’t exist. A lot of the challenge is organizational and cultural: who owns the problem, who trusts automation, and who is allowed to make platform-level decisions.
If I had to sum up my own takeaway from KubeCon 2025, it would be something like this:
Kubernetes and its ecosystem are increasingly well-equipped for continuous optimization. Organizations, on the other hand, are still catching up.
Below are three signals from this year’s show that, to me, support that view.
Signal #1: The Hardest Part Is Politics, Not Kubernetes
From a technical standpoint, I don’t run into many people anymore who think you can solve resource management at scale with manual tuning alone. Once you’re operating real workloads in production, across a bunch of namespaces and clusters, it’s pretty obvious that humans won’t keep every container sized correctly by hand.
But in conversation after conversation, the friction wasn’t about whether optimization can be automated. It was about whether people are comfortable handing control over to a system that does it for them.
Roughly speaking, what I see looks like this:
- Platform / IT teams understand the resource problem well and are under real cost and reliability pressure.
- Developers own the applications and are rightly paranoid about anything that might jeopardize availability.
Developers usually don’t start from, “How can I optimize Kubernetes?” They start from, “My app must not go down.” Many of them don’t want to become Kubernetes experts; they just want enough resources that they never have to think about it.
So when we talk about StormForge by CloudBolt—or any system—that will automatically change CPU and memory for their workloads, the initial reaction is often skepticism:
- What if there’s a bug?
- What if it takes away resources at the wrong time?
- What if that causes an outage and my team is on the hook?
Platform teams, meanwhile, often don’t have the authority to simply say, “This is how we run workloads now. We’re turning on rightsizing.” They have to do a kind of internal sales job:
- Prove, in detail, that the system won’t starve workloads.
- Explain how it works to people who don’t really want to spend cognitive energy on resource internals.
- Meet a standard of proof that is sometimes higher than what would be required for manual changes.
At Acquia, one of our customers, rollout looked very different. Will Reed, a principal engineer there, talked about this at our KubeCon happy hour. They have a central team with enough organizational clout to make platform-level decisions. When they decided to adopt StormForge, they could essentially say, “This is part of the platform now.” That let them bypass a lot of the developer-by-developer distrust that slows other organizations down.
Most of the enterprises I spoke to at KubeCon aren’t set up that way. The platform or central architecture team owns the platform and the cost, but they still need developer sign-off or at least acceptance to move forward.
My sense is that this is one of the biggest practical blockers to wider adoption of resource automation: The technical mechanisms for safe optimization exist. The harder work is building enough trust and alignment that teams will actually let those mechanisms run.
Signal #2: Once Teams Trust Automation, They Want It Everywhere
Funnily enough, I also observed an interesting contrast to that fear.
Once people understand what StormForge does for CPU and memory—especially when we can walk through real examples of how it behaves under load—the conversation shifts pretty quickly from “Can I trust this?” to “Can you do this for other things too?”
Compared to previous KubeCons, I felt like there were more questions this year about additional resource types. This is anecdotal; I didn’t count. But the pattern felt different enough that it stood out.
In particular, people asked about:
- GPUs and other specialized processors (which aligns with the growth in AI/ML workloads on Kubernetes)
- Persistent volumes and storage (“I allocate 100 GB, the app tops out at 10 GB. Is there an opportunity to right-size that?”)
- Other processor types that don’t fit neatly into the usual CPU/memory buckets
In other words, once teams see that continuous optimization for CPU and memory is tractable and safe, their appetite expands. The bottleneck is no longer whether there’s a problem to solve; it’s whether they can get through the initial rollout and build confidence in the system.
This is one of the reasons we’re paying attention to Dynamic Resource Allocation (DRA) in Kubernetes. DRA is a framework that makes it possible to describe and allocate arbitrary resource types. CPU and memory have been treated specially since the beginning—they’re baked in very deeply. DRA is about everything else:
- GPUs
- Network interface cards (NICs)
- And, in theory, resource types that haven’t been widely deployed yet
From my perspective, that’s important for a couple of reasons:
- It gives the platform a standard way to express “this node has X of resource Y, this pod needs Z of resource Y” for any resource type.
- It means we can build optimization on top of a common abstraction rather than writing one-off integrations for each new resource class.
- It keeps what we’re doing aligned with how Kubernetes itself is evolving, instead of fighting the grain of the platform.
As Kubernetes matures, I expect optimization to become more “multi-resource” by default. The interesting question isn’t whether that’s possible, it’s how quickly organizations will adjust their platform strategies and governance to take advantage of it.
Signal #3: AI Is Everywhere, But It Means Very Different Things
It probably won’t surprise anyone that AI was all over the session list and the sponsor hall. What I found interesting was that there were really two very different AI conversations happening.
On the one hand, there’s what I’d call the “AI front-end” story:
- AI-powered dashboards
- “Chat with your cluster” interfaces
- Generic “copilot for Kubernetes” branding
On the other hand, there’s the “AI as workload” story:
- Talks on how to actually run ML and LLM workloads on Kubernetes
- GPU scheduling and sharing
- Kernel-level and networking considerations driven by those workloads
Those talks were much closer to what we care about in the context of resource management. When you’re running substantial AI workloads, you don’t really have the option to be careless about resources. The stakes, both in terms of cost and reliability, are higher.
What This Means If You’re Running Kubernetes at Scale
Taken together, these three signals paint a picture that’s probably not surprising, but I think it’s useful to make explicit: The Kubernetes ecosystem is steadily adding the primitives you need for continuous, multi-resource optimization. The gap is mostly on the human side: structure, trust, and prioritization.
If you see yourself in the conversations I’ve summarized, here are a few practical implications I suggest:
- Treat optimization as a platform concern, not just an ad-hoc project.
You’ll have more success if a central group is responsible for how automation is configured and rolled out, rather than trying to run a bunch of isolated experiments that never quite become “the way we do things.”
- Make developer trust a first-class design goal.
Don’t just turn something on and hope for the best. Show how it behaves under load. Capture before/after metrics, and make them visible in a way that’s meaningful to app owners: what changed, why it changed, and what happened to reliability.
- Assume you’ll eventually care about more than CPU and memory.
Even if that’s where you start, it’s worth keeping an eye on projects like DRA and thinking ahead about GPUs, storage, and other resource types. It’s easier to evolve a strategy that anticipated this than to bolt it on later.
- Be thoughtful about where you apply AI.
There’s a lot of interest in chat-based interfaces right now, and they can be helpful. But the real leverage, in my view, comes from applying machine learning and automation to actually change how resources are requested and used.
KubeCon didn’t reveal a brand-new problem this year. What it did, at least for me, was reinforce that resource management is still a fundamental issue for Kubernetes users and that the main obstacles to solving it are increasingly organizational, not technical.
Related Blogs
Bill-Accurate Kubernetes Cost Allocation, Now Built Into CloudBolt
CloudBolt is introducing granular Kubernetes cost allocation directly within the platform. This new capability delivers bill-level accuracy down to the…