Back to Blog
Product

Building VGAC: From Idea to Platform

March 12, 202610 min read
AE

Andrew Espira

Founder & Lead Engineer

V

VGAC started with a simple frustration: "Why is my job taking so long?" After years of working on GPU clusters — managing them, debugging queue slowdowns, watching teams waste hours refreshing status pages — I decided to build the tool I wished existed.

The Core Insight

GPU clusters behave nothing like CPU systems, but everyone monitors them the same way. Standard dashboards show utilization graphs, but they can't answer: "This job will wait 2 hours because these 3 jobs are holding memory they're not using."

VGAC is purpose-built for how GPUs actually work. Not graphs — answers. Not utilization percentages — predicted wait times before you submit, updated in real-time as the queue changes.

What We Built

The platform grew from a prediction API into a comprehensive observability system. Here's what it covers today:

Queue Intelligence

Wait-time distribution, bifurcation analysis, GPU block rate, percentile breakdowns (P50-P99), SLO tracking.

GPU Telemetry

Per-GPU utilization heatmaps, temperature, power draw, memory usage. Health scoring across the cluster.

Pattern Detection

AI-detected recurring patterns: peak-hour contention, cascading delays, memory pressure precursors, burst submissions.

LLM Inference Analytics

Prefill/decode phase analysis, KV cache health (hit rate, fragmentation, HBM utilization), disaggregation scoring.

Calibrated Predictions

AUROC 0.969, ECE 0.005. Trained on 11,982 GPU jobs. Sub-10ms inference.

Autonomous Agents

Five agents (Observer, Predictor, Calibrator, Actor, Copilot) with calibration-gated autonomy.

The Architecture

VGAC runs as two independent tiers. The Platform Tier handles observability: a FastAPI backend with 150+ endpoints, ClickHouse for analytics, Redis for caching, and a Next.js frontend with 12 dashboard pages. The Agentic Tier is fully serverless: Lambda functions for each agent, DynamoDB for state, API Gateway for routing, and Bedrock for the Copilot's reasoning.

The key architectural decision was calibration-gated autonomy. Every agent checks the model's calibration score before taking action. Above 0.85: act autonomously. Between 0.60-0.85: recommend to a human. Below 0.60: escalate. This means the system is self-aware about its own reliability.

150+
API endpoints
12
Dashboard pages
5
Autonomous agents
11,982
Jobs trained on

Lessons Learned

The biggest insight: this isn't a prediction problem. It's an observability problem. Researchers don't just need wait time estimates — they need to understand why the queue is slow, what patterns are causing contention, and how to configure their jobs differently.

LLM inference workloads taught me that the next generation of GPU scheduling looks fundamentally different. Prefill/decode phase imbalance, KV cache fragmentation, and NIXL transfer latency are the real bottlenecks — invisible to traditional schedulers.

Calibration isn't just a model metric. It's the bridge between 'good predictions' and 'trustworthy automation.'

VGAC is open source

Explore the full platform, run it locally, or deploy to your own cluster.

View on GitHub
Share this post