LLM inference looks simple from the outside: prompt in, tokens out. Underneath, it's a complex multi-phase pipeline where traditional monitoring is blind to the bottlenecks that actually matter.
The Problem with Standard GPU Monitoring
Prometheus + Grafana gives you GPU utilization, memory usage, and throughput. For LLM inference, these metrics hide more than they reveal:
Prefill/Decode Imbalance
Prefill is compute-bound; decode is memory-bound. Average GPU utilization looks fine even when one phase is starved.
KV Cache Fragmentation
HBM fills with cached key-value pairs. A 95% hit rate can hide 30% fragmentation that causes eviction storms.
NIXL Transfer Latency
NVIDIA's NIXL protocol moves KV caches between nodes. A 10ms network hiccup cascades into 200ms tail latencies.
Batch Scheduling Interference
Continuous batching means new requests join mid-decode. Without phase-aware metrics, you can't see the interference.
What LLM Inference Actually Needs
VGAC tracks what matters for LLM serving:
Phase-Aware Metrics
Separate prefill and decode latencies, GPU utilization per phase, and a phase imbalance ratio that tells you exactly which side is the bottleneck.
KV Cache Analytics
Hit rate, eviction rate, HBM utilization, fragmentation score, and optimization recommendations (e.g., "enable PagedAttention" or "reduce max sequence length").
NIXL Transfer Monitoring
Per-transfer latency, bandwidth utilization, backend selection (RDMA vs TCP), and scaling recommendations based on observed transfer patterns.
Disaggregation Scoring
A composite score that quantifies how effectively prefill and decode stages are separated across the serving fleet, with recommendations for topology changes.
You can't optimize what you can't observe. For LLM inference, the things worth observing are invisible to traditional monitoring.
See it in action
VGAC's inference analytics dashboard shows all of this in a single view. Try it out.
View on GitHub