Devoured - April 30, 2026
DeepMind ProEval for GenAI Evaluation (GitHub Repo)

DeepMind ProEval for GenAI Evaluation (GitHub Repo)

AI Read original

DeepMind's ProEval framework can evaluate generative AI models with 100x lower cost by using surrogate models to estimate performance with just 1% of typical benchmark samples.

What: ProEval is an open-source evaluation framework that uses Gaussian Process surrogate models and transfer learning to estimate LLM performance metrics and discover failure patterns while requiring only a fraction of the usual evaluation samples.
Why it matters: Evaluating large language models on comprehensive benchmarks is computationally expensive and time-consuming, especially when testing multiple model variants or conducting safety assessments, making cost-effective evaluation critical for iterative development.
Takeaway: Install via pip and test it on your models using pre-configured benchmarks like GSM8K and MMLU with the BQPriorSampler class.
Deep dive
  • Framework achieves ±1% accuracy in error rate estimation using only ~1% of benchmark samples compared to full evaluation
  • Uses Bayesian Quadrature with Gaussian Process surrogates (BQ-SF, BQ-RPF variants) to model model performance patterns
  • Surrogate models can transfer learning across benchmarks, generalizing to new models without retraining from scratch
  • Proactively discovers diverse failure modes and edge cases under strict evaluation budgets rather than just estimating aggregate metrics
  • Validated on multiple benchmark types including reasoning tasks (GSM8K, MMLU, StrategyQA), safety (Jigsaw), and classification
  • Designed for multi-modal integration into existing GenAI evaluation pipelines with simple API
  • Includes pre-trained models and dataset configurations for common benchmarks to enable immediate use
  • Released under Apache 2.0 license with accompanying arXiv paper (2604.23099) from April 2026
Decoder
  • Surrogate models: Statistical models that approximate expensive-to-evaluate functions, allowing predictions without running full evaluations
  • Gaussian Process (GP): A probabilistic model that provides uncertainty estimates along with predictions, useful for deciding which samples to evaluate next
  • Bayesian Quadrature (BQ): A method that uses Bayesian inference to estimate integrals like average performance efficiently with minimal samples
  • BQ-SF, BQ-RPF: Specific variants of Bayesian Quadrature with different prior formulations used in ProEval
  • Transfer learning: Applying knowledge learned from evaluating previous models to estimate new model performance faster
  • MAE: Mean Absolute Error, measuring the average difference between estimated and true values
Original article

ProEval

Slash GenAI evaluation costs by up to 100x while actively discovering model failure patterns to guide better AI development.

  1. 💰 Cut GenAI eval costs up to 100× — achieve ±1% accuracy with a fraction of the samples
  2. 🔍 Discover failure cases — proactively surface diverse bugs under strict evaluation budgets
  3. 🧠 Transfer learning over benchmarks — pre-trained GP surrogates generalize to new models instantly
  4. 🧩 Easy Integration - Easily to integrate into the GenAI evaluation systems with different modalities
  5. Validated on reasoning, safety & classification — GSM8K, MMLU, StrategyQA, Jigsaw, and more

Installation

pip install -r requirements.txt

Quick Start

from proeval import BQPriorSampler, LLMPredictor, DATASET_CONFIGS
from proeval.sampler import load_predictions, extract_model_predictions
import numpy as np

# Estimate a model's error rate with ~1% of the data
sampler = BQPriorSampler(noise_variance=0.3)
result = sampler.sample(predictions="svamp", target_model="gemini25_flash", budget=50)

# Compare against the true error rate
df = load_predictions("svamp")
pred_matrix, model_names = extract_model_predictions(df)
true_mean = np.mean(pred_matrix[:, model_names.index("gemini25_flash")])

print(f"Estimated error rate: {result.estimates[-1]:.4f}")
print(f"MAE: {result.mae(true_mean):.4f}")

Experiments

Here is an example of how to run the experiments:

python -m experiment.exp_performance_estimation --dataset svamp --n-runs 5

You can find the comprehensive experiment details and dataset settings here.

Citation

If the work did some helps on your research/project, please cite our tech report. Thank you!

@article{huang2026proeval,
  title={{{ProEval}: Proactive Failure Discovery and Efficient Performance Estimation for Generative AI Evaluation}},
  author={Huang, Yizheng and Zeng, Wenjun and Kumaresan, Aditi and Wang, Zi},
  journal={arXiv preprint arXiv:2604.23099 [cs.LG]},
  year={2026},
  url={https://arxiv.org/abs/2604.23099}
}