Most "AI agent" systems follow a simple pattern: observe something, run it through a model, take an action. The problem is deciding when the model is trustworthy enough to act on its own. VGAC solves this with calibration-gated autonomy.
Five Agents, One Feedback Loop
VGAC's agentic layer consists of five specialized agents, each with a distinct role:
Observer Agent
Ingests GPU telemetry, cluster state, and queue events. Builds a real-time model of what's happening across the cluster.
Predictor Agent
Runs the calibrated ML model against current state. Outputs wait-time predictions with confidence intervals.
Calibrator Agent
Monitors prediction accuracy in real-time. Triggers recalibration when the Prediction Impact Index (PII) drifts.
Actor Agent
Executes autonomous actions: node scaling, job preemption, priority adjustments. Only acts when calibration score exceeds threshold.
Copilot Agent
Powered by Amazon Bedrock. Provides natural language explanations, answers 'why is my job stuck?' queries, and generates Slurm scripts.
The Gating Mechanism
The key innovation is that the Actor agent doesn't just check whether the prediction is above a threshold — it checks whether the model's calibration is above a threshold. The flow:
Observer detects queue anomaly (e.g. GPU jobs waiting 3x longer than normal)
Predictor forecasts that wait times will exceed SLO in the next 30 minutes
Calibrator confirms: calibration score is 0.91, PII is within bounds
Actor autonomously scales up 2 GPU nodes and adjusts job priorities
Copilot generates a natural-language explanation for the cluster admin
If the Calibrator had reported a score below 0.60, the Actor would have escalated to a human instead of acting. This ensures the system never takes autonomous actions it isn't confident about.
Selective Evaluation
Observable Decision Logging
Every autonomous decision is logged with full context: what was observed, what was predicted, what the calibration score was, and what action was taken (or deferred). This creates a complete audit trail and enables post-hoc analysis of agent behavior.
Dive deeper
The full agent implementation is open source. See how calibration gating works in practice.
View on GitHub