VGAC | Calibration-Aware GPU Queue Intelligence

LLM Inference Observability

LLM inference looks simple from the outside: prompt in, tokens out. Underneath, it's a complex multi-phase pipeline where traditional monitoring is blind to the bottlenecks that actually matter.

The Problem with Standard GPU Monitoring

Prometheus + Grafana gives you GPU utilization, memory usage, and throughput. For LLM inference, these metrics hide more than they reveal:

Prefill/Decode Imbalance

Prefill is compute-bound; decode is memory-bound. Average GPU utilization looks fine even when one phase is starved.

KV Cache Fragmentation

HBM fills with cached key-value pairs. A 95% hit rate can hide 30% fragmentation that causes eviction storms.

NIXL Transfer Latency

NVIDIA's NIXL protocol moves KV caches between nodes. A 10ms network hiccup cascades into 200ms tail latencies.

Batch Scheduling Interference

Continuous batching means new requests join mid-decode. Without phase-aware metrics, you can't see the interference.

What LLM Inference Actually Needs

VGAC tracks what matters for LLM serving:

Phase-Aware Metrics

Separate prefill and decode latencies, GPU utilization per phase, and a phase imbalance ratio that tells you exactly which side is the bottleneck.

KV Cache Analytics

Hit rate, eviction rate, HBM utilization, fragmentation score, and optimization recommendations (e.g., "enable PagedAttention" or "reduce max sequence length").

NIXL Transfer Monitoring

Per-transfer latency, bandwidth utilization, backend selection (RDMA vs TCP), and scaling recommendations based on observed transfer patterns.

Disaggregation Scoring

A composite score that quantifies how effectively prefill and decode stages are separated across the serving fleet, with recommendations for topology changes.

You can't optimize what you can't observe. For LLM inference, the things worth observing are invisible to traditional monitoring.

See it in action

VGAC's inference analytics dashboard shows all of this in a single view. Try it out.

View on GitHub

LLM Inference Needs New Observability — Not More Grafana