Evaluating self-improving AI agents is fundamentally harder than evaluating static systems. A benchmark must measure not just current performance but the capacity for improvement over time. This page surveys the evaluation landscape — from HyperAgents' own domain-specific metrics to the broader ecosystem of agent benchmarks — and examines the methodological challenges that make evaluation of self-modifying systems an open research problem.
- HyperAgents evaluates across four domains: Polyglot (pass@1), Paper Review (accuracy), Robotics (task score), and IMO Grading (accuracy + MAE)
- The staged evaluation strategy (small → medium → full) balances cost with statistical validity
- improvement@k (imp@k) is a novel metric for measuring transferable self-improvement capability
- Nine major benchmarks span the agent evaluation landscape from code generation to web navigation to safety
- Self-improving systems face unique evaluation challenges including benchmark gaming and the "progress illusion"
HyperAgents' Own Evaluation Domains
The HyperAgents framework, introduced by Jenny Zhang, Bingchen Zhao, Wannan Yang, Jakob Foerster, Jeff Clune, Minqi Jiang, Sam Devlin, and Tatiana Shavrina at Meta in March 2026 (arXiv:2603.19461), evaluates self-improvement across four deliberately diverse domains. The diversity is essential: self-improving systems must demonstrate that their improvements transfer across tasks, not just optimize for a single benchmark.
Polyglot (Code Generation)
The Polyglot domain uses 225 problems from the Exercism platform spanning six programming languages. The primary metric is pass@1 — whether the generated code passes all test cases on the first attempt. Pass@2 serves as a secondary metric capturing near-miss performance. This domain tests the agent's ability to improve its code generation strategies across multiple language paradigms, from imperative (Python, Java) to functional (Haskell) to systems (Rust). Each language has distinct idioms, type systems, and error patterns, making cross-language improvement non-trivial.
Paper Review (Academic Prediction)
The Paper Review domain evaluates the agent's ability to predict whether an academic paper would be accepted or rejected at a top venue. The metric is accuracy of the binary prediction. This domain is qualitatively different from code generation — it requires reasoning about research quality, novelty, methodology, and presentation. Improvements in this domain demonstrate that the agent can refine its analytical and evaluative capabilities, not just its code output.
Robotics (Reward Design)
For the Robotics domain, the agent generates reward functions for robot locomotion tasks in the Genesis physics simulator. The metric is a task score reflecting how well the robot performs the target behavior (e.g., walking) when trained using the generated reward function. This domain involves multi-step reasoning: understanding the physics of locomotion, translating desired behavior into a mathematical reward signal, and anticipating how a reinforcement learning algorithm will optimize against that reward.
IMO Grading (Mathematical Judging)
The IMO Grading domain tests the agent's ability to score mathematical olympiad solutions on the same rubric used by human graders. Two metrics apply: accuracy (exact score match) and MAE (mean absolute error for partial credit). With 1,000 samples paired with human scores, this domain evaluates "judge" capability — the ability to assess the quality and correctness of mathematical reasoning, which is distinct from the ability to produce mathematical reasoning.
The Staged Evaluation Strategy
Running full evaluations on every candidate modification would be prohibitively expensive. HyperAgents implements a three-stage pipeline that balances computational cost with evaluation rigor.
Stage 1: Small Sample Stage 2: Medium Sample Stage 3: Full Evaluation
(~10 tasks) (~50 tasks) (all tasks)
| | |
v v v
Quick filter Confirmation check Definitive score
(reject clearly (reject borderline (threshold = 0.4)
failed mods) poor mods)
| | |
+-- ~70% rejected +-- ~40% rejected +-- Final ranking
The threshold of 0.4 at the final stage means a modification must demonstrate meaningful improvement (above 40th percentile relative performance) to be accepted into the lineage. This threshold represents a trade-off: too low and the system accepts noisy or marginal improvements that may not transfer; too high and the system becomes too conservative, potentially rejecting modifications that are genuinely beneficial but show high variance on the evaluation set.
Staged evaluation introduces selection bias: modifications that happen to perform well on small samples advance, while modifications that would perform well on the full set but poorly on small samples are discarded. This is an accepted trade-off for computational efficiency, but it means the system's evolutionary path is partly shaped by the random sample draws.
The Broader Benchmark Landscape
To contextualize HyperAgents' evaluation approach, the following table surveys the major benchmarks used across the agent evaluation ecosystem as of early 2026.
| Benchmark | Tasks | Metric | Key Strength | Key Limitation |
|---|---|---|---|---|
| SWE-bench | GitHub issues → patches | Resolved % | Real-world engineering tasks | Complex setup, Python-only |
| Polyglot / Aider | 225 Exercism, 6 languages | pass@1, pass@2 | Multilingual, clean eval | Contamination from pretraining |
| GAIA (ICLR 2024) | 450+ real-world problems | Short-answer accuracy | Web + file processing | Humans >> GPT-4 (still) |
| WebArena | 4 web environments | End-to-end success rate | Self-hostable, high fidelity | Limited to web tasks |
| Mind2Web | 2,350 tasks, 137 websites | Trajectory/action matching | Cross-site generalization | Static snapshots, not live |
| OSWorld | 369 computer tasks | Execution-based success | Most realistic (full desktop) | Expensive, slow |
| AgentBench | 8 environment types | Cross-environment scores | Panoramic coverage | Hard to summarize in one score |
| IMO-GradingBench | 1,000 scored samples | Accuracy + MAE | Tests "judge" capability | Narrow domain (math) |
| Agent-SafetyBench | 349 envs, 2,000 test cases | 8 risk categories | Safety as first-class metric | New, limited adoption |
SWE-bench: Real-World Software Engineering
SWE-bench presents agents with real GitHub issues from popular open-source Python projects and measures whether the agent can produce a patch that resolves the issue and passes the project's test suite. The metric is resolved percentage. Since its introduction, the state-of-the-art has risen from approximately 20% to over 50% resolved, driven by advances in both model capability and agent scaffolding. SWE-bench is the closest benchmark to real-world software engineering, but its complexity makes it expensive to run and its Python-only focus limits generalizability. The DGM-H architecture described in arXiv:2603.19461 is well-suited to SWE-bench because the agent can improve its patching strategies across generations.
GAIA: General AI Assistants
Presented at ICLR 2024, GAIA evaluates general-purpose AI assistants on 450+ real-world problems requiring web browsing, file processing, and multi-step reasoning. The metric is short-answer accuracy — the answer is either correct or incorrect, with no partial credit. The landmark finding at launch was that human evaluators significantly outperformed GPT-4, establishing a clear capability gap. GAIA's emphasis on multi-modal, multi-tool problem solving makes it relevant for evaluating whether HyperAgents' cross-domain transfer actually produces more capable general agents.
WebArena and Mind2Web: Web Navigation
WebArena provides four high-fidelity self-hostable web environments (e-commerce, forums, developer tools, CMS) and measures end-to-end task success rate. Mind2Web takes a broader approach with 2,350 tasks across 137 real websites in 31 domains, measuring trajectory and action matching against human demonstrations. Together they represent the web navigation evaluation landscape: WebArena for depth and fidelity, Mind2Web for breadth and cross-site generalization. The Mind2Web follow-up work included an important warning about the "progress illusion" — where improvements on the benchmark do not necessarily translate to improved performance in deployment.
OSWorld: Full Computer Environments
OSWorld is the most realistic agent benchmark, presenting 369 tasks in full desktop computer environments with multimodal inputs (screenshots, text). The metric is execution-based success — did the agent actually accomplish the task, verified by checking the computer's state afterward. This realism comes at a cost: evaluations are slow, expensive, and difficult to reproduce across different hardware configurations.
AgentBench: Panoramic Evaluation
AgentBench evaluates agents across 8 diverse environment types, providing cross-environment failure analysis. Its panoramic scope makes it valuable for understanding where agents fail, but the diversity makes it difficult to summarize performance in a single score. The cross-environment analysis is particularly relevant for HyperAgents' claim of cross-domain transfer — AgentBench's methodology for comparing performance across environments could inform how transfer is measured.
Agent-SafetyBench: Safety as a First-Class Metric
With 349 environments and 2,000 test cases spanning 8 risk categories, Agent-SafetyBench is the first comprehensive safety evaluation for LLM-based agents. Its central finding — that defensive prompts alone are insufficient — has direct implications for self-improving systems like HyperAgents. See the HyperAgents safety and governance page for a detailed discussion of Agent-SafetyBench's relevance to self-modifying systems.
improvement@k: A Novel Metric for Self-Improvement
Traditional benchmarks measure static performance: how well does the agent perform right now? For self-improving systems, the more important question is: how effectively does the agent improve over time?
The HyperAgents framework introduces improvement@k (imp@k) as a metric specifically designed to capture transferable self-improvement capability. Rather than measuring absolute performance on a fixed benchmark, imp@k measures the magnitude of improvement achieved over k generations of self-modification. Critically, imp@k evaluates improvement on held-out tasks, not just the tasks used for self-modification selection, ensuring that measured improvements reflect genuine capability gains rather than overfitting to the evaluation set.
# Conceptual definition of improvement@k
def improvement_at_k(agent, domain, k):
# Measure baseline performance on held-out test set
baseline = evaluate(agent.generation_0, domain.test_set)
# Run k generations of self-improvement
evolved = agent.self_improve(generations=k, domain=domain)
# Measure post-improvement on SAME held-out test set
improved = evaluate(evolved, domain.test_set)
return improved - baseline
The imp@k metric is valuable because it captures the process of improvement, not just the outcome. Two systems could achieve the same final performance, but the one that reaches it in fewer generations (lower k) demonstrates more effective self-improvement. Additionally, imp@k across different domains reveals whether the system achieves genuine cross-domain transfer or whether improvement in one domain comes at the expense of another.
Methodological Challenges
Evaluating self-improving systems introduces methodological challenges that do not exist for static agent evaluation.
Statistical Validity of Staged Evaluation
The staged evaluation strategy described earlier introduces selection bias. Modifications are filtered based on small samples before being evaluated on the full set. This means the system's evolutionary trajectory is partly determined by which modifications happen to perform well on small random samples — a form of sampling noise that compounds over generations. Rigorous evaluation should report confidence intervals that account for this selection effect, but this is computationally expensive because it requires running multiple independent evolutionary trajectories.
Benchmark Gaming in Self-Improving Systems
When a system can modify its own code, the boundary between "legitimate optimization" and "gaming the benchmark" becomes blurry. A self-improving agent might discover that a particular output format consistently receives higher scores from the evaluation harness, not because it reflects better reasoning but because it happens to align with parsing assumptions. In the Polyglot domain, an agent might learn to generate code that passes test cases through hardcoded outputs rather than general-purpose algorithms. The reset protection on the domains/ directory prevents the agent from modifying the evaluation harness directly, but it cannot prevent indirect gaming.
The "Progress Illusion"
Follow-up work on Mind2Web highlighted a critical concern: improvements on benchmark scores can be illusory, reflecting overfitting to the benchmark distribution rather than genuine capability improvement. For self-improving systems, this risk is amplified because the system explicitly optimizes for benchmark performance across generations. A system might show steadily increasing imp@k values while its real-world capability remains flat or even degrades on out-of-distribution tasks. Addressing this requires evaluation on tasks that are withheld not just from training but from the entire self-improvement loop.
Cost and Reproducibility Barriers
Running HyperAgents' evaluation pipeline requires substantial compute: GPU resources for the Genesis simulation, multiple LLM API calls per generation across OpenAI, Anthropic, and Google endpoints, and multi-hour runtimes for full evaluation cycles. This creates a reproducibility barrier — independent researchers may not be able to verify results without equivalent compute budgets. The use of commercial API models (GPT-4, Claude, Gemini) adds another reproducibility challenge: model behavior changes across API versions, making exact reproduction of results time-sensitive.
Evaluating Self-Improving Systems Differently
The fundamental insight is that self-improving systems need fundamentally different evaluation approaches than static agents. A static agent is a fixed function: evaluate it on a test set and you have characterized its capability. A self-improving agent is a trajectory: you need to characterize not just where it is but where it is going and how fast it gets there.
Metrics that matter for self-improving systems include the rate of improvement (how quickly does performance increase per generation?), transfer breadth (how many domains benefit from self-improvement in one domain?), ceiling behavior (does the system plateau, and at what level?), stability (does performance ever regress across generations?), and robustness (do improvements persist when evaluated on out-of-distribution tasks?). The imp@k metric captures some of these dimensions, but a complete evaluation framework for self-improving systems remains an open research challenge.
Automated Evaluation vs. Real-World Validity
There is an inherent tension between evaluation that is scalable (automated, fast, cheap) and evaluation that is valid (reflecting real-world usefulness). Automated benchmarks like Polyglot pass@1 can be run in seconds per task, enabling evaluation at scale. But they measure narrow capabilities in controlled settings. Real-world evaluation — like having the agent actually fix bugs in production codebases or generate reward functions for physical robots — is valid but expensive and slow.
HyperAgents' multi-domain evaluation strategy represents a pragmatic middle ground: the domains are automated enough for rapid evaluation but diverse enough to resist narrow optimization. The Polyglot domain tests code generation, Paper Review tests analytical reasoning, Robotics tests physical simulation understanding, and IMO Grading tests mathematical judgment. An agent that improves across all four is more likely to have developed genuinely transferable capabilities than one that improves on only one.
As the field of self-improving agents matures, evaluation methodologies will need to evolve alongside the systems they measure. The work of Zhang, Clune, Jiang and collaborators at Meta on the HyperAgents framework provides a foundation, but the evaluation landscape remains one of the most active and unsettled areas of agent research.
← Back to HyperAgents overview
Frequently Asked Questions
What benchmarks does HyperAgents use for evaluation?
HyperAgents evaluates across four domains: Polyglot (225 Exercism programming problems across 6 languages, measured by pass@1), Paper Review (academic paper accept/reject prediction, measured by accuracy), Robotics (Genesis simulation task scores for robot locomotion), and IMO Grading (mathematical olympiad solution scoring, measured by accuracy and MAE). Each domain uses a staged evaluation strategy progressing from small samples to full evaluation sets.
What is improvement@k and why does it matter?
Improvement@k (imp@k) is a novel metric introduced by the HyperAgents framework for measuring transferable improvement capability. Unlike traditional metrics that measure absolute performance on a benchmark, imp@k measures whether the system's self-modifications produce genuine, transferable improvements across k generations. This captures the self-improving property directly rather than just snapshot performance.
How does the staged evaluation strategy work in HyperAgents?
HyperAgents uses a three-stage evaluation pipeline: first a small sample (~10 tasks) for rapid feedback, then a medium sample (~50 tasks) for confirmation, and finally a full evaluation set if the modification passes a threshold of 0.4. This reduces computational cost by filtering out clearly unsuccessful modifications early while reserving expensive full evaluations for promising candidates.
What is the difference between SWE-bench and HyperAgents' Polyglot domain?
SWE-bench evaluates agents on real GitHub issues from popular Python repositories, measuring whether the agent can generate patches that pass the project's test suite. HyperAgents' Polyglot domain uses 225 Exercism problems across 6 programming languages, measuring pass@1 code generation accuracy. SWE-bench tests real-world software engineering in complex codebases; Polyglot tests multilingual algorithmic problem-solving with cleaner evaluation but higher contamination risk from pretraining data.
Why is evaluating self-improving systems harder than evaluating static agents?
Self-improving systems require evaluation not just of current performance but of the improvement trajectory over time. They face unique challenges: benchmark gaming (optimizing for the metric rather than the capability), selection bias from staged evaluation, the "progress illusion" where apparent improvements do not transfer to unseen tasks, and the fundamental problem that the system can potentially modify its own evaluation criteria through its self-improvement mechanism.