When we tell people VGAC's prediction model has an AUROC of 0.969, they're impressed. When we tell them the ECE is 0.005, they ask: "What's ECE?" That second number is the one that actually matters for building trustworthy AI systems.
Accuracy vs. Calibration: The Difference
Accuracy (measured by AUROC) tells you whether the model can distinguish between jobs that will wait a long time and jobs that won't. A high AUROC means the model ranks risks correctly.
Calibration (measured by ECE — Expected Calibration Error) tells you something deeper: when the model says there's a 70% chance of a long wait, does that actually happen 70% of the time? A model can be highly accurate but badly calibrated — it ranks correctly but the probabilities are wrong.
High Accuracy, Bad Calibration
Model says 90% chance of long wait for most jobs. It's right about ranking, but 90% doesn't mean 90%. You can't trust the number itself.
High Accuracy, Good Calibration
Model says 70% and it's right 70% of the time. Says 30% and it's right 30% of the time. The probabilities are meaningful.
Why Calibration Unlocks Autonomy
This distinction is critical for autonomous systems. In VGAC, we have agents that can take actions — scale up nodes, preempt lower-priority jobs, trigger recalibration. The question is: when should they act on their own vs. ask a human?
If the model's probabilities are well-calibrated, we can use them directly as confidence scores. A 95% prediction from a calibrated model genuinely means "we're very confident." That lets us gate autonomous actions:
Calibration-Gated Autonomy
Without calibration, you can't do this safely. A miscalibrated model might say "95% confident" when it's actually 60% confident. Autonomous actions based on that will go wrong, erode trust, and eventually get the whole system turned off.
VGAC's Numbers
An ECE of 0.005 means our predicted probabilities are off by less than half a percent on average. When VGAC says there's a 70% chance your job will wait more than 5 minutes, the actual rate is between 69.5% and 70.5%. That's the kind of precision that makes autonomous operations safe.
The Prediction Impact Index
We go one step further with the Prediction Impact Index (PII) — a metric that quantifies the real-world cost of miscalibration:
When PII exceeds a threshold, the Calibrator agent automatically triggers model recalibration. This creates a self-improving loop: the model monitors its own reliability and fixes itself before the predictions degrade enough to cause problems.
Calibration isn't a nice-to-have metric. It's the foundation that determines whether your AI system can be trusted to act on its own.
Explore the codebase
VGAC is open source. See how calibration-gated autonomy works in practice.
View on GitHub