GPU Inference Serving with vLLM
Notes on running LLM/VLM inference in production on GPUs β specifically using vLLM on Kubernetes (AKS).
What is vLLM?β
vLLM is an open-source inference engine for LLMs. Key features:
- Continuous batching β processes multiple requests in one GPU forward pass
- PagedAttention β efficient KV cache memory management
- OpenAI-compatible API β drop-in replacement endpoint
- Supports quantisation (INT8, INT4) to reduce VRAM usage
One Model Per vLLM Processβ
Critical constraint: vLLM serves one model per process (pod). You cannot load two different models into the same vLLM instance.
If you need VLM + LLM (e.g., Qwen3-VL for document reading + Qwen3-32B for text reasoning), you need two separate pods.
Running 2 Models: 2 Pods Optionsβ
Option A β 2 Separate Nodes (recommended for production)β
Node 1 (A100 80GB) Node 2 (A100 80GB)
βββ vLLM pod (Qwen3-VL) βββ vLLM pod (Qwen3-32B)
Full GPU Full GPU
- Full GPU throughput per model
- Independent scaling β add VLM nodes if document volume grows
- Simple config β each pod gets
nvidia.com/gpu: 1cleanly - Easy model upgrades β upgrade one without touching the other
- Downside: 2Γ baseline cost
Option B β A100 MIG (Multi-Instance GPU) on 1 Nodeβ
A100 supports MIG (Multi-Instance GPU) β partition the physical GPU into isolated slices, each with dedicated VRAM and compute.
Node 1 (A100 80GB, MIG-enabled)
βββ MIG slice 1 (40GB) β vLLM pod (Qwen3-VL-8B, needs ~16GB FP16)
βββ MIG slice 2 (40GB) β vLLM pod (Qwen3-32B INT8, needs ~32GB)
- Half the cost of Option A
- True isolation β each pod has dedicated VRAM, no interference
- Downside: MIG partition sizes are fixed β less flexible than 2 independent nodes
- Downside: More complex Kubernetes config (device plugin setup)
Option C β GPU Time-Slicing (avoid for production inference)β
NVIDIA time-slicing lets multiple pods share one GPU by taking turns. Both request nvidia.com/gpu: 1 but physically share the device.
- Cheapest option
- Downside: Latency unpredictable β pods wait for their turn
- Downside: Not suitable for < 60s SLA inference workloads
Decision guide:
| Option A (2 nodes) | Option B (MIG) | Option C (time-slice) | |
|---|---|---|---|
| Cost | 2Γ | 1Γ | 1Γ |
| Throughput | Maximum | Good | Degraded |
| Complexity | Low | Medium | Low |
| Production SLA | Yes | Yes | No |
| Independent scaling | Yes | No | No |
Recommendation: Start with Option B (MIG) for Wave 1 / small volume. Graduate to Option A (2 nodes) when document volume grows or models need to scale independently.
vLLM Dynamic Batching (Continuous Batching)β
This is often confused with "batch processing jobs". They are different things:
| Dynamic batching (vLLM) | Batch processing jobs | |
|---|---|---|
| What | Groups multiple concurrent inference requests into one GPU forward pass | Long-running jobs: model training, nightly re-indexing |
| Duration | Milliseconds to seconds | Minutes to hours |
| SLA | Within real-time SLA | Best-effort / hourly |
| Where | gpu-realtime pool | gpu-batch pool |
When 30 POs arrive simultaneously, vLLM groups them into a batch of up to 32 and processes in one shot β still completing within the < 60s SLA. Configure with:
env:
- name: MAX_NUM_SEQS
value: "32" # Max concurrent sequences per forward pass
- name: GPU_MEMORY_UTILIZATION
value: "0.92" # Use 92% of A100 VRAM
VRAM Sizing Guide (A100 80GB)β
| Model | FP16 VRAM | INT8 VRAM | Fits on A100 80GB? |
|---|---|---|---|
| Qwen3-VL-8B | ~16 GB | ~8 GB | Yes, comfortably |
| Qwen3-32B | ~64 GB | ~32 GB | Yes, INT8 only |
| Qwen3-VL-8B + Qwen3-32B (same node MIG) | ~80 GB | ~40 GB | Yes, INT8, tight |
| BGE-M3 (embedding) | ~2 GB | ~1 GB | Yes, very small |
Rule of thumb: Use INT8 quantisation for large models in production. Quality loss is negligible for document extraction and text mapping tasks.
GPU Pool Design Patternβ
For a production AI hub serving both VLM and LLM:
gpu-realtime pool (on-demand VMs)
βββ vLLM pod β VLM model (document reading)
βββ vLLM pod β LLM model (text reasoning / mapping)
gpu-batch pool (spot VMs)
βββ vLLM / embedding service β BGE-M3, batch jobs
Keep real-time inference on on-demand nodes (guaranteed capacity, never preempted). Put batch workloads on spot nodes (60% cheaper, tolerate interruption).