Deployment with Horizontal Pod Autoscaler

Configure a Kubernetes Deployment with resource requests and limits optimized for Horizontal Pod Autoscaler (HPA) based on CPU utilization.

Patterns

Detailed Explanation

Autoscaling with HPA

The Horizontal Pod Autoscaler (HPA) automatically adjusts the number of pod replicas based on observed metrics like CPU utilization. For HPA to work correctly, your pods must have resource requests defined.

Deployment Configuration

apiVersion: apps/v1
kind: Deployment
metadata:
  name: scalable-app
  labels:
    app: "scalable-app"
spec:
  replicas: 2
  template:
    spec:
      containers:
        - name: app
          image: my-app:latest
          ports:
            - name: http
              containerPort: 8080
          resources:
            requests:
              cpu: "200m"
              memory: "256Mi"
            limits:
              cpu: "1000m"
              memory: "512Mi"
          readinessProbe:
            httpGet:
              path: /ready
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 5

HPA Configuration (separate resource)

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: scalable-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: scalable-app
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

Why Resource Requests Matter

HPA calculates CPU utilization as a percentage of the requested CPU. If your pod requests 200m CPU and is using 140m, utilization is 70%. Without requests, HPA cannot calculate utilization and will not scale.

Best Practices for HPA

  • Set requests realistically: Overly generous requests mean HPA scales too late; too low means it scales too aggressively
  • CPU limits > requests: Allow bursting above requests for short spikes without triggering a scale event
  • Readiness probes: New pods should only receive traffic when ready; otherwise, HPA may scale up more than needed
  • Cooldown periods: Default scale-down stabilization is 5 minutes to prevent flapping
  • Min replicas >= 2: Always keep at least 2 replicas for high availability

Use Case

Building auto-scaling applications on Kubernetes that handle variable traffic loads, from web APIs to microservices that need to scale with demand while maintaining cost efficiency.

Try It — K8s Pod Spec Builder

Open full tool