VGAC started with a simple frustration: "Why is my job taking so long?" After years of working on GPU clusters — managing them, debugging queue slowdowns, watching teams waste hours refreshing status pages — I decided to build the tool I wished existed.
The Core Insight
GPU clusters behave nothing like CPU systems, but everyone monitors them the same way. Standard dashboards show utilization graphs, but they can't answer: "This job will wait 2 hours because these 3 jobs are holding memory they're not using."
VGAC is purpose-built for how GPUs actually work. Not graphs — answers. Not utilization percentages — predicted wait times before you submit, updated in real-time as the queue changes.
What We Built
The platform grew from a prediction API into a comprehensive observability system. Here's what it covers today:
Queue Intelligence
Wait-time distribution, bifurcation analysis, GPU block rate, percentile breakdowns (P50-P99), SLO tracking.
GPU Telemetry
Per-GPU utilization heatmaps, temperature, power draw, memory usage. Health scoring across the cluster.
Pattern Detection
AI-detected recurring patterns: peak-hour contention, cascading delays, memory pressure precursors, burst submissions.
LLM Inference Analytics
Prefill/decode phase analysis, KV cache health (hit rate, fragmentation, HBM utilization), disaggregation scoring.
Calibrated Predictions
AUROC 0.969, ECE 0.005. Trained on 11,982 GPU jobs. Sub-10ms inference.
Autonomous Agents
Five agents (Observer, Predictor, Calibrator, Actor, Copilot) with calibration-gated autonomy.
The Architecture
VGAC runs as two independent tiers. The Platform Tier handles observability: a FastAPI backend with 150+ endpoints, ClickHouse for analytics, Redis for caching, and a Next.js frontend with 12 dashboard pages. The Agentic Tier is fully serverless: Lambda functions for each agent, DynamoDB for state, API Gateway for routing, and Bedrock for the Copilot's reasoning.
The key architectural decision was calibration-gated autonomy. Every agent checks the model's calibration score before taking action. Above 0.85: act autonomously. Between 0.60-0.85: recommend to a human. Below 0.60: escalate. This means the system is self-aware about its own reliability.
Lessons Learned
The biggest insight: this isn't a prediction problem. It's an observability problem. Researchers don't just need wait time estimates — they need to understand why the queue is slow, what patterns are causing contention, and how to configure their jobs differently.
LLM inference workloads taught me that the next generation of GPU scheduling looks fundamentally different. Prefill/decode phase imbalance, KV cache fragmentation, and NIXL transfer latency are the real bottlenecks — invisible to traditional schedulers.
Calibration isn't just a model metric. It's the bridge between 'good predictions' and 'trustworthy automation.'
VGAC is open source
Explore the full platform, run it locally, or deploy to your own cluster.
View on GitHub