DeepMind ProEval for GenAI Evaluation (GitHub Repo)

DeepMind's ProEval framework can evaluate generative AI models with 100x lower cost by using surrogate models to estimate performance with just 1% of typical benchmark samples.

What: ProEval is an open-source evaluation framework that uses Gaussian Process surrogate models and transfer learning to estimate LLM performance metrics and discover failure patterns while requiring only a fraction of the usual evaluation samples.

Why it matters: Evaluating large language models on comprehensive benchmarks is computationally expensive and time-consuming, especially when testing multiple model variants or conducting safety assessments, making cost-effective evaluation critical for iterative development.

Takeaway: Install via pip and test it on your models using pre-configured benchmarks like GSM8K and MMLU with the BQPriorSampler class.

Deep dive

Framework achieves ±1% accuracy in error rate estimation using only ~1% of benchmark samples compared to full evaluation
Uses Bayesian Quadrature with Gaussian Process surrogates (BQ-SF, BQ-RPF variants) to model model performance patterns
Surrogate models can transfer learning across benchmarks, generalizing to new models without retraining from scratch
Proactively discovers diverse failure modes and edge cases under strict evaluation budgets rather than just estimating aggregate metrics
Validated on multiple benchmark types including reasoning tasks (GSM8K, MMLU, StrategyQA), safety (Jigsaw), and classification
Designed for multi-modal integration into existing GenAI evaluation pipelines with simple API
Includes pre-trained models and dataset configurations for common benchmarks to enable immediate use
Released under Apache 2.0 license with accompanying arXiv paper (2604.23099) from April 2026

Decoder

Surrogate models: Statistical models that approximate expensive-to-evaluate functions, allowing predictions without running full evaluations
Gaussian Process (GP): A probabilistic model that provides uncertainty estimates along with predictions, useful for deciding which samples to evaluate next
Bayesian Quadrature (BQ): A method that uses Bayesian inference to estimate integrals like average performance efficiently with minimal samples
BQ-SF, BQ-RPF: Specific variants of Bayesian Quadrature with different prior formulations used in ProEval
Transfer learning: Applying knowledge learned from evaluating previous models to estimate new model performance faster
MAE: Mean Absolute Error, measuring the average difference between estimated and true values

Original article

ProEval

Slash GenAI evaluation costs by up to 100x while actively discovering model failure patterns to guide better AI development.

💰 Cut GenAI eval costs up to 100× — achieve ±1% accuracy with a fraction of the samples
🔍 Discover failure cases — proactively surface diverse bugs under strict evaluation budgets
🧠 Transfer learning over benchmarks — pre-trained GP surrogates generalize to new models instantly
🧩 Easy Integration - Easily to integrate into the GenAI evaluation systems with different modalities
✅ Validated on reasoning, safety & classification — GSM8K, MMLU, StrategyQA, Jigsaw, and more

Installation

pip install -r requirements.txt

Quick Start

from proeval import BQPriorSampler, LLMPredictor, DATASET_CONFIGS
from proeval.sampler import load_predictions, extract_model_predictions
import numpy as np

# Estimate a model's error rate with ~1% of the data
sampler = BQPriorSampler(noise_variance=0.3)
result = sampler.sample(predictions="svamp", target_model="gemini25_flash", budget=50)

# Compare against the true error rate
df = load_predictions("svamp")
pred_matrix, model_names = extract_model_predictions(df)
true_mean = np.mean(pred_matrix[:, model_names.index("gemini25_flash")])

print(f"Estimated error rate: {result.estimates[-1]:.4f}")
print(f"MAE: {result.mae(true_mean):.4f}")

Experiments

Here is an example of how to run the experiments:

python -m experiment.exp_performance_estimation --dataset svamp --n-runs 5

You can find the comprehensive experiment details and dataset settings here.

Citation

If the work did some helps on your research/project, please cite our tech report. Thank you!

@article{huang2026proeval,
  title={{{ProEval}: Proactive Failure Discovery and Efficient Performance Estimation for Generative AI Evaluation}},
  author={Huang, Yizheng and Zeng, Wenjun and Kumaresan, Aditi and Wang, Zi},
  journal={arXiv preprint arXiv:2604.23099 [cs.LG]},
  year={2026},
  url={https://arxiv.org/abs/2604.23099}
}