Skip to main content

GPU Inference Serving with vLLM

Notes on running LLM/VLM inference in production on GPUs β€” specifically using vLLM on Kubernetes (AKS).

What is vLLM?​

vLLM is an open-source inference engine for LLMs. Key features:

  • Continuous batching β€” processes multiple requests in one GPU forward pass
  • PagedAttention β€” efficient KV cache memory management
  • OpenAI-compatible API β€” drop-in replacement endpoint
  • Supports quantisation (INT8, INT4) to reduce VRAM usage

One Model Per vLLM Process​

Critical constraint: vLLM serves one model per process (pod). You cannot load two different models into the same vLLM instance.

If you need VLM + LLM (e.g., Qwen3-VL for document reading + Qwen3-32B for text reasoning), you need two separate pods.

Running 2 Models: 2 Pods Options​

Node 1 (A100 80GB)          Node 2 (A100 80GB)
└── vLLM pod (Qwen3-VL) └── vLLM pod (Qwen3-32B)
Full GPU Full GPU
  • Full GPU throughput per model
  • Independent scaling β€” add VLM nodes if document volume grows
  • Simple config β€” each pod gets nvidia.com/gpu: 1 cleanly
  • Easy model upgrades β€” upgrade one without touching the other
  • Downside: 2Γ— baseline cost

Option B β€” A100 MIG (Multi-Instance GPU) on 1 Node​

A100 supports MIG (Multi-Instance GPU) β€” partition the physical GPU into isolated slices, each with dedicated VRAM and compute.

Node 1 (A100 80GB, MIG-enabled)
β”œβ”€β”€ MIG slice 1 (40GB) β†’ vLLM pod (Qwen3-VL-8B, needs ~16GB FP16)
└── MIG slice 2 (40GB) β†’ vLLM pod (Qwen3-32B INT8, needs ~32GB)
  • Half the cost of Option A
  • True isolation β€” each pod has dedicated VRAM, no interference
  • Downside: MIG partition sizes are fixed β€” less flexible than 2 independent nodes
  • Downside: More complex Kubernetes config (device plugin setup)

Option C β€” GPU Time-Slicing (avoid for production inference)​

NVIDIA time-slicing lets multiple pods share one GPU by taking turns. Both request nvidia.com/gpu: 1 but physically share the device.

  • Cheapest option
  • Downside: Latency unpredictable β€” pods wait for their turn
  • Downside: Not suitable for < 60s SLA inference workloads

Decision guide:

Option A (2 nodes)Option B (MIG)Option C (time-slice)
Cost2Γ—1Γ—1Γ—
ThroughputMaximumGoodDegraded
ComplexityLowMediumLow
Production SLAYesYesNo
Independent scalingYesNoNo

Recommendation: Start with Option B (MIG) for Wave 1 / small volume. Graduate to Option A (2 nodes) when document volume grows or models need to scale independently.

vLLM Dynamic Batching (Continuous Batching)​

This is often confused with "batch processing jobs". They are different things:

Dynamic batching (vLLM)Batch processing jobs
WhatGroups multiple concurrent inference requests into one GPU forward passLong-running jobs: model training, nightly re-indexing
DurationMilliseconds to secondsMinutes to hours
SLAWithin real-time SLABest-effort / hourly
Wheregpu-realtime poolgpu-batch pool

When 30 POs arrive simultaneously, vLLM groups them into a batch of up to 32 and processes in one shot β€” still completing within the < 60s SLA. Configure with:

env:
- name: MAX_NUM_SEQS
value: "32" # Max concurrent sequences per forward pass
- name: GPU_MEMORY_UTILIZATION
value: "0.92" # Use 92% of A100 VRAM

VRAM Sizing Guide (A100 80GB)​

ModelFP16 VRAMINT8 VRAMFits on A100 80GB?
Qwen3-VL-8B~16 GB~8 GBYes, comfortably
Qwen3-32B~64 GB~32 GBYes, INT8 only
Qwen3-VL-8B + Qwen3-32B (same node MIG)~80 GB~40 GBYes, INT8, tight
BGE-M3 (embedding)~2 GB~1 GBYes, very small

Rule of thumb: Use INT8 quantisation for large models in production. Quality loss is negligible for document extraction and text mapping tasks.

GPU Pool Design Pattern​

For a production AI hub serving both VLM and LLM:

gpu-realtime pool (on-demand VMs)
β”œβ”€β”€ vLLM pod β€” VLM model (document reading)
└── vLLM pod β€” LLM model (text reasoning / mapping)

gpu-batch pool (spot VMs)
└── vLLM / embedding service β€” BGE-M3, batch jobs

Keep real-time inference on on-demand nodes (guaranteed capacity, never preempted). Put batch workloads on spot nodes (60% cheaper, tolerate interruption).