Skip to main content

Kubernetes: Pod vs Node

One of the most confusing concepts when starting with Kubernetes. The office building analogy clicked for me.

The Simple Analogy​

KubernetesReal world
NodeThe building (physical or virtual)
PodOne office (tenant) inside the building
KubernetesThe building manager
  • Node = a real Azure/AWS/GCP virtual machine with actual CPU, RAM, and optionally GPU. You pay for it by the hour whether anyone is inside or not.
  • Pod = a running program that lives inside a node. Uses a slice of the node's CPU/RAM/GPU. One node can hold many pods.
  • Kubernetes = decides which pod goes into which node. Adds new nodes when full. Removes empty nodes to save cost.

Key Differences​

PropertyNodePod
What is itAzure/AWS VMRunning software process
Startup time5–8 min (VM provisioning)10–30 sec (container start)
CostPaid per hour while it existsNo direct cost β€” shares node's cost
Who manages itCluster AutoscalerKEDA / HPA
Count on one node1Many

One Node, Many Pods​

NODE (Azure VM β€” 24 CPU Β· 220 GB RAM Β· 1Γ— A100 GPU)
β”œβ”€β”€ vLLM pod β†’ uses: GPU (full), 8 CPU, 64 GB RAM
β”œβ”€β”€ Doc normaliser pod β†’ uses: 2 CPU, 4 GB RAM (no GPU)
β”œβ”€β”€ Router pod β†’ uses: 1 CPU, 2 GB RAM (no GPU)
└── SAP connector pod β†’ uses: 2 CPU, 4 GB RAM (no GPU)

Pod count does not affect Azure billing. Whether 1 pod or 10 pods run on the same node, the bill stays the same. Cost only increases when you add a second node.

When does a new Node get added?​

Only when the node runs out of a resource that a new pod needs. For AI inference:

  1. New POs arrive β†’ KEDA adds more vLLM pods (fast, seconds)
  2. Node runs out of GPU capacity β†’ new pods can't fit β†’ go Pending
  3. Cluster Autoscaler detects Pending pods β†’ provisions a new Azure VM (5–8 min)
  4. New pods schedule on the new node

The Scaling Rule​

KEDA scales pods first (seconds). When the node can't fit more pods, Cluster Autoscaler scales nodes (minutes).

This is why pre-warming matters for AI inference β€” if you need a node ready in < 60 seconds, you must provision it before traffic arrives, not after.

GPU Pods β€” Special Rules​

With GPU workloads, one important constraint: by default, nvidia.com/gpu: 1 in Kubernetes means exclusive access to the full GPU. Only one pod can claim it per GPU.

On a node with 1 A100:

  • Valid: 1 vLLM pod (owns the GPU) + N CPU-only pods
  • Invalid: 2 vLLM pods both requesting nvidia.com/gpu: 1 β€” second one can't schedule

Solutions when you need 2 GPU pods on 1 node: see GPU Inference Serving.