Kubernetes: Pod vs Node
One of the most confusing concepts when starting with Kubernetes. The office building analogy clicked for me.
The Simple Analogyβ
| Kubernetes | Real world |
|---|---|
| Node | The building (physical or virtual) |
| Pod | One office (tenant) inside the building |
| Kubernetes | The building manager |
- Node = a real Azure/AWS/GCP virtual machine with actual CPU, RAM, and optionally GPU. You pay for it by the hour whether anyone is inside or not.
- Pod = a running program that lives inside a node. Uses a slice of the node's CPU/RAM/GPU. One node can hold many pods.
- Kubernetes = decides which pod goes into which node. Adds new nodes when full. Removes empty nodes to save cost.
Key Differencesβ
| Property | Node | Pod |
|---|---|---|
| What is it | Azure/AWS VM | Running software process |
| Startup time | 5β8 min (VM provisioning) | 10β30 sec (container start) |
| Cost | Paid per hour while it exists | No direct cost β shares node's cost |
| Who manages it | Cluster Autoscaler | KEDA / HPA |
| Count on one node | 1 | Many |
One Node, Many Podsβ
NODE (Azure VM β 24 CPU Β· 220 GB RAM Β· 1Γ A100 GPU)
βββ vLLM pod β uses: GPU (full), 8 CPU, 64 GB RAM
βββ Doc normaliser pod β uses: 2 CPU, 4 GB RAM (no GPU)
βββ Router pod β uses: 1 CPU, 2 GB RAM (no GPU)
βββ SAP connector pod β uses: 2 CPU, 4 GB RAM (no GPU)
Pod count does not affect Azure billing. Whether 1 pod or 10 pods run on the same node, the bill stays the same. Cost only increases when you add a second node.
When does a new Node get added?β
Only when the node runs out of a resource that a new pod needs. For AI inference:
- New POs arrive β KEDA adds more vLLM pods (fast, seconds)
- Node runs out of GPU capacity β new pods can't fit β go
Pending - Cluster Autoscaler detects
Pendingpods β provisions a new Azure VM (5β8 min) - New pods schedule on the new node
The Scaling Ruleβ
KEDA scales pods first (seconds). When the node can't fit more pods, Cluster Autoscaler scales nodes (minutes).
This is why pre-warming matters for AI inference β if you need a node ready in < 60 seconds, you must provision it before traffic arrives, not after.
GPU Pods β Special Rulesβ
With GPU workloads, one important constraint: by default, nvidia.com/gpu: 1 in Kubernetes means exclusive access to the full GPU. Only one pod can claim it per GPU.
On a node with 1 A100:
- Valid: 1 vLLM pod (owns the GPU) + N CPU-only pods
- Invalid: 2 vLLM pods both requesting
nvidia.com/gpu: 1β second one can't schedule
Solutions when you need 2 GPU pods on 1 node: see GPU Inference Serving.