Kubernetes Cost Optimization

Kubernetes clusters often incur costs 30-60% higher than necessary due to resource overprovisioning and poor scheduling decisions. Applications request more CPU and memory than they consume, leaving nodes underutilized while appearing fully allocated.

Organizations that understand Kubernetes cost optimization and implement techniques to reduce this waste can drastically reduce their cloud and IT expenses. 

This article explains how organizations can use Kubernetes cost optimization techniques to reduce waste with proper resource sizing, intelligent autoscaling, and cost-effective node selection. We’ll also explore how to implement systematic cost controls that keep Kubernetes resources running efficiently and affordably quarter after quarter. 

While the amount of control teams have over their Kubernetes costs is directly correlated with where they are on the Kubernetes cost control spectrum, every team using Kubernetes in production can benefit from the best practices in this article.

Summary of key Kubernetes cost optimization best practices

The table below summarizes seven fundamental Kubernetes cost optimization best practices that this article will explore in detail. 

Best practiceDescription
Properly configure resource requests and limitsConfigure resource specifications to avoid overprovisioning while maintaining performance
Apply cluster autoscalingAutomatically adjusting cluster size based on workload demands to reduce idle resource costs
Ensure pods are right-sizedAnalyzing and adjusting container resource allocations based on actual usage patterns
Select the right node type and size for the jobChoosing appropriate node types and sizes to optimize the cost-performance ratio
Implement multi-tenancy optimization and namespace-level cost controlEfficiently sharing cluster resources across multiple applications or teams by applying ResourceQuotas and PodDistributionBudgets
Monitor costs and attribute Kubernetes expenses Monitoring and attributing Kubernetes costs to specific workloads, teams, or projects
Leverage spot instances and discountsLeveraging cost-saving pricing models for non-critical or fault-tolerant workloads
Autonomous Rightsizing for K8S Workloads

FREE TRIAL

Automated vertical autoscaling designed to scale for 100K+ containers

Fully compatible with HPA functionality and cloud-based services

Powered by advanced machine learning with user-controlled guardrails

Understanding Kubernetes resource costs

Kubernetes costs accumulate differently than traditional infrastructure because expenses occur at the node level while applications operate at the pod level. A typical organization runs dozens of nodes, each costing $200-500 per month, but determining which specific applications drive those costs proves nearly impossible without proper tooling.

The core problem stems from Kubernetes’ three-layer resource model: requests (guaranteed allocation), limits (maximum allowed usage), and actual consumption. When pods request more resources than they use, nodes become underutilized even though they appear “full” from Kubernetes’ perspective.

Consider the following example: a 4-core node running four pods, each requesting 1 core but consuming only 0.2 cores, wastes 68% of available capacity while still being considered fully allocated. Kubernetes schedules based on requests, not actual usage, so the scheduler treats this node as having no available capacity for additional workloads.

# Pod requests 1000m CPU but only uses ~200m
resources:
requests:
cpu: "1000m" # Kubernetes reserves full core
memory: "2Gi"
limits:
cpu: "2000m"
memory: "4Gi"
# Actual usage: 200m CPU, 800Mi memory

This overprovisioning multiplies across hundreds of workloads. Engineering teams estimate resource requirements conservatively to avoid performance issues, but these safety margins compound into significant waste. A cluster with 50 overprovisioned pods can easily waste $2,000-5,000 monthly in unused capacity.

Automate K8s autoscaling with machine learning

Learn More

Solution Rightsizing recommendations Automation Fully compatible with HPA Powered by machine learning Historical metrics analysis Trend forecasting
VPA
StormForge

Why traditional cost management fails

Cloud cost management typically relies on tagging and associating resources with specific projects or teams. Kubernetes breaks this model because it serves multiple dynamic workloads simultaneously on shared infrastructure. You cannot simply tag a node with “Team A” when it runs pods from five teams throughout the day.

Resource sharing creates additional complications through Kubernetes’ Quality of Service classes. Guaranteed pods (requests equal limits) get scheduling priority and eviction protection. Burstable pods (requests less than limits) can consume extra resources when available, but they face eviction during periods of resource pressure. Best-Effort pods (no requests or limits) utilize leftover capacity but are terminated first during contention.

These QoS interactions mean that cost attribution requires understanding which pods run where and how their resource specifications affect scheduling decisions and resource utilization patterns across the entire cluster.

Resource waste patterns

The most common waste patterns include memory overprovisioning for Java applications that never reach their heap limits, CPU overallocation for web servers that spend most time waiting for I/O operations, and development workloads consuming production-grade resources during off-hours.

Monitoring tools like Prometheus reveal these patterns through metric comparison. Applications consistently using 200m CPU while requesting 1000m represent clear optimization opportunities. However, identifying and fixing these inefficiencies requires systematic analysis across all workloads rather than ad-hoc investigation.

Teams can address these waste patterns through three distinct optimization layers. First, configure proper resource requests and limits at the pod level. Second, teams should implement intelligent autoscaling to match infrastructure capacity with demand. Third, select cost-effective node types that align with workload characteristics. Each layer builds upon the previous one to eliminate different sources of waste. Let’s review each of them in detail..

Resource requests and limits optimization

Setting appropriate CPU and memory requests requires analyzing usage patterns over representative periods. Start by monitoring workloads for 2-4 weeks, including peak traffic periods and special events. Set requests at the 80th percentile of observed usage plus a 10-20% buffer for traffic spikes.

For CPU specifications, use millicores for precision. Instead of requesting full cores, specify 250m (quarter core) or 500m (half core) based on actual needs. Consider the difference between burstable and guaranteed Quality of Service classes. 

Guaranteed QoS occurs when the number of requests equals the limits. These pods have the lowest eviction priority during resource pressure. They get terminated last when nodes experience resource contention. However, this protection may waste resources during periods of low usage.

Memory requires more careful consideration because it’s non-compressible. When containers exceed memory limits, Kubernetes terminates them with OOMKilled errors, rather than throttling them like CPU resources. Common best practices for memory resource management  are monitoring heap usage for JVM applications, accounting for garbage collection overhead, and including buffers for unexpected memory spikes.

A typical production configuration might look like:

resources:
requests:
cpu: "250m"
memory: "512Mi"
limits:
cpu: "500m"
memory: "1Gi"

This configuration provides a 2:1 ratio between limits and requests, allowing for burst capacity while preventing resource hoarding. The goal here is to balance conservative requests that support efficient scheduling with limits that avoid resource contention.

Autoscaling strategies for Kubernetes cost optimization

Kubernetes autoscaling reduces costs by matching infrastructure capacity to actual demand. The Cluster Autoscaler automatically provisions nodes when pods remain pending due to resource constraints and removes underutilized nodes when demand decreases.

Kubernetes offers three complementary autoscaling approaches that work together to optimize costs while maintaining performance. 

  • Cluster autoscaling adjusts the underlying infrastructure capacity
  • Horizontal autoscaling manages the number of pod replicas.
  • Vertical autoscaling optimizes individual pod resource allocations. 

The sections below explore each of these techniques in detail. 

Stop Setting Kubernetes Requests and Limits

LEARN HOW

Cluster autoscaling

Cluster autoscaling matches infrastructure capacity to workload demand by automatically adding or removing nodes based on resource requirements. When pods cannot be scheduled due to insufficient resources, cluster autoscaling provisions additional nodes. Conversely, the system removes nodes to reduce costs when they become underutilized.

Cluster Autoscaler operates by monitoring unschedulable pods and node utilization levels. Configure scale-down policies carefully using parameters like scale-down-utilization-threshold (typically 50-60%) to determine when nodes are considered underutilized, and scale-down-unneeded-time (usually 10-15 minutes) to set grace periods before node removal. The Cluster Autoscaler works with pre-defined node groups, which can limit flexibility in instance type selection.

Karpenter provides more sophisticated cluster autoscaling through just-in-time provisioning and intelligent instance selection. Instead of pre-defined node groups, Karpenter analyzes pending pod requirements and selects optimal instance types from the entire available catalogue. This approach often reduces costs by 10-20% compared to fixed node group configurations. Karpenter’s consolidation feature continuously optimizes node utilization by moving workloads to more cost-effective instances and terminating underutilized nodes. 

Horizontal autoscaling

Horizontal autoscaling adjusts the number of pod replicas based on observed metrics, allowing applications to handle varying load levels efficiently. This scaling approach maintains consistent per-pod resource allocation while adding or removing instances to match demand.

Horizontal Pod Autoscaler (HPA) scales workloads based on CPU utilization, memory usage, or custom metrics. The HPA works in coordination with cluster autoscaling—when the HPA scales pods up due to increased demand, the Cluster Autoscaler provisions additional nodes if existing capacity is insufficient. Underutilized nodes become candidates for removal when the HPA scales down during low-demand periods. This coordination between pod-level and node-level scaling creates comprehensive cost optimization. 

Kubernetes Event-Driven Autoscaler (KEDA) extends the HPA’s capabilities by enabling scaling based on external metrics and events rather than just CPU and memory utilization. KEDA can scale workloads based on queue depth, database connections, custom application metrics, or scheduled events. This event-driven approach enables more precise scaling decisions, aligning resource consumption more closely with actual business demand. For example, scaling based on message queue length ensures processing capacity matches workload requirements more accurately than CPU-based scaling alone. 

Vertical autoscaling

Vertical autoscaling optimizes individual pod resource allocations by analyzing usage patterns and adjusting CPU and memory requests and limits. Unlike horizontal scaling, which changes the number of instances, vertical scaling modifies the resources allocated to each instance to better match actual consumption patterns.

The Vertical Pod Autoscaler (VPA) provides automated recommendations for resource optimization through three operational modes: Off (recommendations only), Initial (sets resources at pod creation), and Auto (updates running pods with restart requirements). 

VPA supports multi-container pods and offers valuable insights into resource usage patterns. However, it has restart requirements for applying changes in Auto mode. Additionally, recommendation accuracy can be limited in complex production scenarios. These factors make VPA insufficient for comprehensive cost optimization in high-load production environments.

While technically VPA is an autoscaler, what it does in practice is pod right-sizing. Let’s dive into that next.

Pod right-sizing techniques for Kubernetes cost optimization

Resource right-sizing addresses one of the most common inefficiencies in Kubernetes environments: the disconnect between what applications request and what they actually consume. When developers deploy workloads, they typically estimate resource requirements conservatively, often requesting significantly more CPU and memory than necessary to avoid performance issues. However, Kubernetes schedules pods based on these resource requests rather than actual usage. This means overprovisioned applications consume node capacity even when running at low utilization.

The right-sizing process

Effective optimization requires tracking actual resource usage over weeks or months to set accurate resource specifications. Industry best practices recommend using percentile-based analysis rather than simple averages.
In the simplest case, averages can be misleading for applications with irregular workloads. Instead, use percentile analysis to guide your resource settings:,

Setting resource requests

Set requests based on the 80th percentile of observed CPU and memory usage to handle everyday operations effectively.

requests:
cpu: "200m" # Based on 80th percentile: 160m + 25% buffer
memory: "256Mi" # Based on 80th percentile: 200Mi + 25% buffer

Let’s consider an example where monitoring data shows the application typically uses 160m CPU and 200Mi memory. Here we’ve added a 25% buffer to account for variability.

Configuring resource limits

Set limits at the 95th percentile to accommodate traffic spikes without wasting resources during normal operations. 

limits:
cpu: "400m" # 95th percentile observed usage
memory: "512Mi" # 95th percentile observed usage

Here, let’s consider monitoring shows that 95% of the time the application’s resource usage stays within 400m of CPU and 512Mi of RAM. Setting limits at these values means we will accommodate most traffic patterns while preventing excessive resource consumption during the rare extreme spikes.

This approach aligns with the VPA’s general methodology. Although the VPA uses more sophisticated algorithms that incorporate safety margins, confidence intervals, and historical variance rather than simple percentile cutoffs. Additionally, the VPA keeps usage samples for 8 days in a “decaying histogram” where older samples contribute less than fresh ones, then applies percentile analysis with safety margins.

Manual approaches with open-source tools

The most common approach involves deploying open-source tools( like Prometheus) to collect resource metrics over representative time periods, typically 2-4 weeks, to capture various operational scenarios, including peak traffic and batch processing windows. Grafana serves as your analysis platform, enabling you to visualize usage patterns and identify optimization opportunities across your workload portfolio.

The analysis workflow involves querying historical metrics to calculate percentile distributions for each application, then updating resource specifications based on observed consumption patterns. For instance, you might discover that a service requesting 2 CPU cores consistently operates at 400 millicores at the 80th percentile and peaks at 800 millicores at the 95th percentile. This suggests an opportunity to reduce requests to 500 millicores with limits at 1 CPU core.

The VPA enhances this process by automatically analyzing usage patterns and generating optimization recommendations based on averages. Running the VPA in recommendation mode provides valuable insights without automatically applying changes, allowing you to validate suggestions against your manual analysis. This approach works well for static workloads that aren’t dynamic in nature or workloads without an HPA.

Automated pod right-sizing optimization 

While manual analysis captures point-in-time snapshots of resource usage, ML-powered systems like StormForge understand that applications evolve continuously, with resource needs influenced by code changes, traffic patterns, and business cycles.

The platform’s pattern recognition capabilities operate across multiple dimensions simultaneously, identifying correlations between different resource types, temporal usage patterns, and performance characteristics that manual analysis typically cannot detect or efficiently address. 

StormForge’s continuous adaptation eliminates the common problem of optimization drift, where manually tuned applications gradually return to wasteful configurations as their behaviour changes over time. The platform automatically adjusts resource specifications as applications evolve, maintaining optimal efficiency without requiring ongoing manual intervention.

Multi-objective optimization significantly improves on manual methods that usually focus only on reducing costs. StormForge optimizes for cost, performance, and reliability simultaneously, maintaining SLA compliance while reducing expenses.

Cost-effective node selection

Node selection strategy directly impacts your cost efficiency because even perfectly right-sized pods can waste money if they’re running on inappropriate infrastructure. The principle here is straightforward: match your workload characteristics to instance capabilities that provide the best price-performance ratio for your specific use cases.

Instance type considerations and workload matching

Cloud providers offer specialized instance families designed for different resource patterns, and understanding these distinctions helps you make cost-effective placement decisions. Compute-optimized instances like c5 or c6i families provide high CPU-to-memory ratios, making them ideal for web servers, API gateways, or batch processing primarily consuming CPU cycles. Memory-optimized instances like r5 or r6i families offer the inverse relationship, providing large amounts of RAM relative to CPU cores for applications like Redis clusters, in-memory databases, or data processing workloads.

Consider a practical scenario: your Redis cache requires 64 GB of memory but only utilizes 2 CPU cores. Running this on general-purpose instances may require multiple smaller nodes to achieve sufficient memory capacity, resulting in underutilized CPU resources. A single memory-optimized instance provides the required memory more cost-effectively while eliminating the overhead of managing multiple nodes. Instance size selection within families also matters because larger instances typically offer better price-performance ratios. However, you must balance this against fault tolerance considerations, as larger instances represent larger failure domains.

Node pool diversification and affinity strategies

Effective diversification involves creating targeted node pools that serve different workload categories without over-engineering your cluster topology. A typical strategy includes general-purpose instances for most workloads, compute-optimized pools for CPU-intensive applications, and memory-optimized pools for data-heavy services. Spot instance integration adds another cost optimization dimension, providing 50-90% savings for fault-tolerant workloads. That happens through mixed node pools that combine spot and on-demand capacity.

Node affinity and anti-affinity rules provide the scheduling intelligence to direct workloads to appropriate infrastructure automatically. You can configure batch processing jobs to prefer spot instances using node affinity rules.

Anti-affinity rules prevent scheduling conflicts by controlling the placement of pods across nodes. These rules stop multiple replicas of memory-intensive applications from scheduling on the same node. This approach prevents you from providing larger instances than necessary to accommodate poor pod distribution.

Karpenter’s automated optimization approach

Karpenter analyzes pending pod requirements in real time. It selects optimal instance types from the entire available catalogue. This approach is more effective than manual configuration.

Instead of predicting workload patterns, Karpenter reacts to actual demand. It doesn’t require pre-configured node pools. The system considers multiple factors at once: specific resource requirements, current pricing across instance families, availability zone capacity, and your specified constraints.

The consolidation feature optimizes node utilization by identifying opportunities to move workloads to more cost-effective instances. When usage patterns change or more efficient instance types become available, Karpenter automatically initiates workload migration and terminates underutilized nodes. 

The StormForge Optimize Live platform helps assess these data points by providing a detailed analysis about node utilization and allocation:

As well as node instance type distribution:

Namespace-level Kubernetes cost optimization 

Effective cost management in EKS requires establishing clear boundaries and accountability at the namespace level. Namespaces provide natural isolation for teams, applications, and environments, making them ideal units for cost allocation and control. Without proper namespace-level controls, a single team can inadvertently consume cluster resources worth thousands of dollars, leaving other teams with insufficient capacity.

Setting up ResourceQuotas with cost-aware limits

ResourceQuotas control aggregate resource consumption across all pods in a namespace by setting hard limits on:

  • CPU and memory allocations based on actual AWS pricing
  • Storage volumes to prevent data cost overruns
  • Object counts (pods, services, load balancers) to control infrastructure costs
  • Load balancer limits since each ALB costs ~$16/month

The key is designing quotas based on actual costs rather than arbitrary numbers. If your team has a $500/month budget and you’re running c5.2xlarge nodes at $0.34/hour, calculate exactly how much CPU and memory that budget provides: requests.cpu: “10” for approximately $200/month allocation.

Using LimitRanges to enforce container-level defaults

LimitRanges solves the “forgot to set limits” problem by automatically applying resource constraints to containers that don’t specify their own limits. This prevents individual containers from consuming excessive resources.

The most effective LimitRange setting for cost control is maxLimitRequestRatio, which helps maintain consistency between resource requests and limits. Without this control, pods can be configured with minimal requests (10m CPU) and much higher limits (2 CPU cores). This configuration leads to suboptimal scheduling and inefficient resource utilization. Kubernetes schedules based on requests, but pods can consume up to their limits, so maintaining a reasonable ratio improves scheduling accuracy and overall resource efficiency.

LimitRanges also establish sensible defaults for teams that don’t want to consider resource specifications, automatically applying proven configurations that balance performance with cost efficiency.

Configuring PodDisruptionBudgets to balance availability with cost efficiency

PodDisruptionBudgets (PDBs) enable aggressive cost optimization without breaking applications by defining how many pods can be unavailable during voluntary disruptions. They’re essential for:

  • Maintenance windows: Safe node upgrades and patches
  • Cluster autoscaling: Controlled node removal during scale-down
  • Spot instance interruptions: Graceful handling of instance terminations

Service-specific PDB strategies:

User-facing services: minAvailable: 2 maintains high availability during cost optimization events. Batch workloads: maxUnavailable: 80% enables aggressive spot instance usage and opportunistic scaling. Stateful applications: Conservative settings protect data consistency during disruptions

PDBs become critical when using cluster autoscaling. They control the number of pods being drained simultaneously to prevent breaking the applications. Without PDBs, a cluster scale-down event could terminate all service replicas, causing complete outages.

Kubernetes cost visibility and allocation

Effective cost management requires attributing expenses to specific teams, applications, and projects through systematic tagging and monitoring.

Kubernetes labels and annotations provide the foundation for cost attribution by enabling granular resource tagging. Labels allow you to query and group resources for cost analysis, while annotations can store additional metadata like budget codes or project identifiers that don’t affect Kubernetes operations.

A simple example of a labelling strategy can look like this:

  • Team ownership: team: frontend, team: backend
  • Environment designation: environment: production, environment: staging
  • Cost center allocation: cost-center: engineering, cost-center: marketing

Manual labelling processes inevitably break down as teams deploy more frequently. Developers forget to add required cost allocation labels during deployments. This “forgot to add labels” problem can destroy cost attribution at scale.

Automated enforcement: Admission controllers like Gatekeeper (which implements Open Policy Agent) can be used to enforce labelling policies automatically. This eliminates human error from the labelling process.

Leveraging cost-saving infrastructure options

Infrastructure costs can contribute significantly to overall Kubernetes costs. Cloud cost optimization using spot instances can help organizations reduce these costs if the use case can accept the tradeoffs that come with spot instances. Tools like Karpenter can take it a step further and help teams leverage spot instance prices when available, and fall back to a more reliable (and more expensive) instance type when they are not. Let’s take a closer look at these two Kubernetes cost optimization options. 

Cloud cost optimization for Kubernetes workloads

All major cloud providers offer spot instances (AWS), preemptible VMs (Google Cloud), and spot VMs (Azure) that provide higher cost savings compared to on-demand pricing. These interruption-tolerant instances represent the largest immediate cost optimization opportunity for Kubernetes clusters.

When to use spot instances:

  • Fault-tolerant workloads: Batch processing, CI/CD pipelines, development environments, and stateless applications that handle interruptions gracefully
  • Scalable applications: Services with multiple replicas that can tolerate individual pod terminations without service disruption

When to avoid spot instances:

  • Critical stateful workloads: Databases, message queues, or applications requiring persistent connections that can’t handle sudden terminations
  • Single-replica services: Applications without redundancy where instance loss causes immediate service outage

Karpenter’s spot instance optimization 

Karpenter maximizes spot instance effectiveness through intelligent diversification and hybrid capacity strategies. When configuring NodePools with both spot and on-demand capacity types, Karpenter prioritizes spot instances by default and automatically falls back to on-demand instances when spot capacity becomes unavailable. 

The system uses Price Capacity Optimized (PCO) allocation strategy for spot instances, which considers both the lowest price and the lowest interruption probability, while using the lowest-price allocation for on-demand instances. 

For effective spot-to-spot consolidation, consider providing Karpenter with at least a couple of instance families and different instance types to prevent “race to the bottom” scenarios where nodes continuously replace with cheaper but less stable instances. 

The best defence against running out of spot capacity is allowing Karpenter to provision from as many distinct instance types as possible. In this case, even higher-spec instances can be cheaper in the spot market than on-demand alternatives.

Experience StormForge in a sandbox – no email required

Access Sandbox

Conclusion

Kubernetes cost optimization requires a systematic approach addressing workload-level efficiency and infrastructure-level intelligence. The techniques covered in this article—from autoscaling strategies and pod right-sizing to intelligent node selection—work together to create comprehensive cost management that scales with your environment. Success stems from understanding how these different optimization layers interact and complement one another.

The evolution from manual optimization to automated, ML-driven platforms, such as StormForge and Karpenter, represents the natural progression as Kubernetes environments mature and scale. Manual approaches provide valuable learning experiences and work well for smaller deployments, but automated systems become paramount for managing cost optimization across hundreds of workloads. 

Solve your cloud ROI problem

See for yourself how CloudBolt’s full lifecycle approach can help you.

Request a demo

Explore the chapters:

Related Blogs

 
thumbnail
You don’t need to rebuild a data center. You need to rebuild control.

I came up in the on‑prem days. Roaming data center floors with a VT terminal and a bundle of serial…

 
thumbnail
Inside the execution gap: why cloud ROI is still elusive 

Part 3 of our series unpacking the 2025 CloudBolt Industry Insights (CII) Report  Cloud has always promised value: faster time…

 
thumbnail
The Automation Illusion: Why ‘Fully Automated’ Doesn’t Mean Optimized 

Part 2 of our Inside the Execution Gap series  Ask any FinOps leader where they stand on automation, and you’ll…