One of the most striking results from the HyperAgents paper is the demonstration that self-improvement capability can transfer across domains. A meta-agent trained on paper review and robotics achieves an imp@50 of 0.630 on unseen IMO grading, while a hand-crafted DGM meta-mechanism scores 0.0. This proves that the "ability to improve" is domain-general, not domain-specific.
- imp@k metric separates task competence from self-improvement capability
- Transferred DGM-H meta-agent achieves imp@50 = 0.630 on unseen domain
- Hand-crafted DGM meta achieves imp@50 = 0.0 when transferred
- Domain-general improvement emerges from editable meta-mechanisms
- Open-ended archive enables transferable meta-strategies across tasks
The imp@k Metric: Measuring the Ability to Improve
Traditional AI evaluation asks a straightforward question: how well does this agent perform on a given task? Accuracy, F1 score, pass rate, and similar metrics answer that question. But the HyperAgents framework, introduced by Jenny Zhang, Bingchen Zhao, Wannan Yang, Jakob Foerster, Jeff Clune, Minqi Jiang, Sam Devlin, and Tatiana Shavrina at Meta Research in March 2026 (arXiv:2603.19461), poses a fundamentally different question: how good is this agent at getting better?
This is where the imp@k (improvement at k) metric comes in. The metric is defined as the performance gain achieved by a fixed meta-agent after exactly k modification steps, relative to the initial agent. By "fixed" meta-agent, the researchers mean that the meta-mechanism is frozen at evaluation time and no longer evolving. The only thing changing is the task-level agent, under the guidance of the frozen meta-agent.
This design cleanly separates two distinct capabilities. First, there is task competence, which asks how well the current version of the agent solves the task at hand. Second, there is improvement competence, which asks how effectively the meta-mechanism drives performance gains across successive modification steps. A meta-agent might start with a relatively weak task agent but demonstrate exceptional improvement competence by rapidly lifting the task agent's performance over 50 steps.
The imp@k metric is evaluated at fixed intervals, typically at k = 10, 20, 30, 40, and 50 steps. Each step represents one full modification cycle: the meta-agent analyzes current performance, generates improvement instructions, modifies the task agent's code, and runs validation. The resulting performance curve shows how rapidly and how far the meta-agent can push task performance from its initial baseline.
Why imp@k Matters for Transfer
Without a metric like imp@k, it would be impossible to evaluate cross-domain transfer of improvement capability. Standard task metrics are domain-specific. Accuracy on paper review is measured in correctly predicted accept and reject decisions. Success rate in robotics is measured by whether the simulated robot completes its task. These metrics are incomparable across domains. But imp@k normalizes across all of them by measuring relative improvement, which makes it possible to ask whether improvement capability learned in one domain transfers to another.
Transfer Experiment Design
The transfer experiment described in the HyperAgents paper follows a rigorous three-phase protocol that clearly establishes whether improvement capability is domain-general or domain-specific.
Phase 1: Multi-Domain Training
The DGM-H hyperagent is trained across two source domains simultaneously: Paper Review (predicting accept or reject decisions on academic papers, measured by accuracy) and Robotics (designing reward functions for simulated robots using the Genesis physics simulator, measured by success rate). During training, the meta-agent is free to modify any part of the codebase, including its own meta-mechanisms. The open-ended archive maintains a population of diverse agent variants, and selection pressure drives improvement in both domains.
Phase 2: Meta-Agent Freezing
After training, the meta-agent's modification procedures are frozen. This means the meta-level code that generates improvement instructions, analyzes performance, and decides what to modify is locked in place. It can no longer evolve. This freezing step is critical because it ensures that any improvement observed in the target domain comes from the meta-strategies learned during training, not from continued adaptation to the new domain.
Phase 3: Target Domain Evaluation
The frozen meta-agent is then applied to a completely unseen domain: IMO Grading, which involves scoring olympiad-level mathematics solutions. This domain has different input formats (mathematical proofs rather than code patches or academic papers), different output requirements (numerical scores rather than binary decisions or reward functions), and different evaluation criteria. The meta-agent has never seen this domain during training.
Results: Domain-General Improvement
The results are unambiguous and represent one of the strongest findings in the HyperAgents paper.
| Meta-Mechanism | Training Domains | Target Domain | imp@50 |
|---|---|---|---|
| Hand-crafted DGM meta | Paper Review + Robotics | IMO Grading | 0.0 |
| DGM-H hyperagent meta | Paper Review + Robotics | IMO Grading | 0.630 |
The hand-crafted DGM meta-mechanism, despite being carefully designed by human engineers, achieves zero improvement when transferred to the IMO grading domain. After 50 modification steps, the task agent is no better than it started. In stark contrast, the DGM-H hyperagent meta-mechanism, which evolved its own improvement strategies, achieves an imp@50 of 0.630. This means the transferred meta-agent drives substantial performance gains on a domain it has never encountered.
Interpreting the Gap
The gap between 0.0 and 0.630 is not merely a quantitative difference. It represents a qualitative shift in what the meta-mechanism has learned. The hand-crafted DGM meta has learned domain-specific heuristics for paper review and robotics. It knows how to suggest improvements to code that processes academic papers and how to refine reward functions for robotic simulation. These strategies are brittle and domain-locked.
The DGM-H meta-mechanism, by contrast, has evolved domain-general improvement strategies. These strategies likely include general-purpose patterns such as structured error analysis, iterative hypothesis testing, incremental code organization, and systematic exploration of the solution space. These patterns work regardless of whether the task involves grading math proofs, reviewing papers, or designing robot rewards.
DGM-H eliminates the assumption that task performance and self-improvement capability must be domain-aligned. The ability to improve is a transferable skill, not a domain-specific artifact.
Multi-Domain Evaluation Design
The broader evaluation framework in the HyperAgents paper spans four domains, each chosen to stress-test different aspects of agent capability. Jenny Zhang et al. selected these domains specifically because they span different input modalities, output formats, and evaluation metrics.
| Domain | Task | Metric | Train / Val / Test | Input Type |
|---|---|---|---|---|
| Polyglot | Code patching across languages | pass@1 | 300 / 100 / 100 | Code diffs |
| Paper Review | Accept/reject prediction | accuracy | 200 / 50 / 100 | PDF papers |
| Robotics | Reward function design | success rate | 12 / 4 / 4 | Task descriptions |
| IMO Grading | Olympiad math scoring | grading acc. | 60 / 20 / 20 | Math solutions |
This diversity is intentional. Polyglot tests code-level reasoning with structured inputs. Paper Review requires understanding complex academic arguments from unstructured PDF documents. Robotics involves creative design of reward signals for physics simulation. IMO Grading requires deep mathematical reasoning about proof correctness and partial credit. No single set of domain-specific strategies could address all four, which is precisely the point.
Staged Evaluation for Cost Control
Running self-improvement experiments at scale is expensive. Each modification step requires LLM inference for the meta-agent's analysis, LLM inference for code generation, and execution of the modified agent on validation data. Over 50 steps across multiple domains, costs accumulate rapidly. The researchers address this with a staged evaluation protocol.
In staged evaluation, initial generations are evaluated on a smaller subset of the validation data. Only agent variants that show promise in early stages are promoted to full evaluation. This reduces the total number of expensive full evaluations while maintaining the ability to discover strong variants. However, staged evaluation introduces a methodological trade-off: promising variants that start slowly might be pruned too early, potentially biasing the search toward strategies that show immediate gains rather than slow-building approaches that eventually outperform.
The researchers acknowledge this trade-off explicitly and argue that the cost savings (roughly a 3-4x reduction in total compute) justify the risk. They also note that the strongest meta-mechanisms tend to produce improvements visible within the first 10 steps, making early pruning relatively safe in practice.
The Role of Open-Ended Archives
A critical enabler of cross-domain transfer is the open-ended archive, a core component of the HyperAgents architecture. The archive maintains a diverse population of agent variants, each representing a different approach to both the task and the meta-level improvement strategy. Unlike a simple "keep the best" approach, the archive explicitly maintains diversity through novelty-based selection pressure.
During multi-domain training, the archive accumulates not just task-level innovations (like a better paper parsing strategy or a more effective reward function template) but also meta-level innovations (like a more systematic approach to error diagnosis or a better strategy for prioritizing which parts of the codebase to modify). These meta-level innovations are the ones that transfer across domains.
The Stepping Stones Principle
The open-ended archive enables what the researchers call the "stepping stones" principle. Improvements in one domain create knowledge artifacts, such as code patterns, analysis strategies, and organizational techniques, that serve as stepping stones for improvement in other domains. A meta-strategy that evolves to handle the complexity of paper review (processing long documents, multi-criteria evaluation, structured output formatting) may produce general-purpose techniques for information extraction and structured reasoning that also benefit IMO grading.
This stepping stones dynamic is why multi-domain training produces better transfer than single-domain training. The diversity of challenges forces the meta-mechanism to develop more general strategies rather than overfitting to the idiosyncrasies of a single domain.
Connection to Transfer Learning Research
The HyperAgents cross-domain transfer result connects to a long lineage of research in transfer learning, but with a crucial distinction. Traditional transfer learning transfers learned representations or model parameters from a source task to a target task. The transferred knowledge is about the data distributions, feature structures, and task-specific patterns. Transfer in the HyperAgents context operates at a higher level of abstraction: the transferred knowledge is about how to improve, not about the tasks themselves.
This is analogous to the difference between transferring knowledge of French vocabulary to help learn Spanish (traditional transfer learning) and transferring knowledge of how to study languages effectively to help learn any new language (meta-transfer learning). The second kind of transfer is more powerful and more general because it operates on the learning process itself rather than on the learned content.
The connection to meta-learning research (Finn et al., 2017; Hospedales et al., 2021) is clear but also distinct. Meta-learning typically optimizes model initialization or learning rules for fast adaptation. HyperAgents goes further by allowing the entire improvement procedure, including the code that implements analysis, decision-making, and modification, to be evolved and transferred. This provides a richer and more expressive space for meta-strategies than parameter-space meta-learning alone.
Limitations and Open Questions
The researchers are transparent about the limitations of the current transfer results. The primary limitation is scope: transfer was demonstrated from two source domains (paper review and robotics) to one target domain (IMO grading). Several open questions remain.
- Does transfer scale with source domain count? Training on three source domains rather than two might produce stronger transfer, but it might also dilute the meta-strategies if the domains are too diverse. The optimal number and diversity of source domains remains unknown.
- How far can transfer stretch? Paper review, robotics, and IMO grading are all cognitive tasks that benefit from structured reasoning. Would transfer work to domains that require fundamentally different capabilities, such as real-time control or social interaction?
- What is the minimum training budget for transfer? The experiments used 50 modification steps per domain. Whether meaningful meta-strategies emerge with fewer steps, and how many source-domain steps are needed for effective transfer, are practical questions for deployment.
- Are the transferred strategies interpretable? The paper documents emergent engineering capabilities but does not provide a systematic analysis of which specific meta-strategies transfer most effectively across domains.
Despite these limitations, the core finding stands: the DGM-H architecture enables something that hand-crafted meta-mechanisms cannot achieve. The ability to improve is demonstrably transferable when the meta-mechanism is itself evolved rather than engineered by hand.
The cross-domain transfer experiments are described in Section 5 of the HyperAgents paper (arXiv:2603.19461) by Jenny Zhang et al. at Meta Research, published March 2026. The source code is available on GitHub under CC BY-NC-SA 4.0 license.
Frequently Asked Questions
What is the imp@k metric in HyperAgents?
The imp@k (improvement at k) metric measures the performance gain achieved by a fixed meta-agent after exactly k modification steps, relative to the initial agent. It separates "how good at the task" from "how good at getting better," providing a standardized way to evaluate self-improvement capability independently of baseline task performance. It is typically evaluated at k = 10, 20, 30, 40, and 50.
How does cross-domain transfer work in HyperAgents?
Cross-domain transfer works by training the meta-mechanisms on source domains (paper review and robotics), then freezing the evolved meta-agent and applying it to an unseen target domain (IMO grading). The DGM-H hyperagent achieves imp@50 = 0.630 on the target domain, compared to 0.0 for hand-crafted DGM meta-mechanisms, proving that the ability to improve is domain-general when the meta-mechanism is itself evolved.
Why does hand-crafted DGM fail at cross-domain transfer?
Hand-crafted DGM meta-mechanisms encode domain-specific assumptions about what constitutes improvement. When transferred to an unseen domain, these fixed strategies achieve imp@50 = 0.0 because they cannot generalize. DGM-H overcomes this by allowing the meta-mechanism to evolve general-purpose improvement strategies through open-ended exploration, which naturally generalizes across task boundaries.
What domains were used in the HyperAgents evaluation?
The full evaluation spans four domains: Polyglot (code patching, pass@1), Paper Review (accept/reject prediction, accuracy), Robotics (reward function design for Genesis simulator, success rate), and IMO Grading (olympiad math solution scoring, grading accuracy). Each domain features different input types, output formats, and evaluation metrics. The transfer experiment specifically used Paper Review and Robotics as source, with IMO Grading as the held-out target.
What are the known limitations of HyperAgents cross-domain transfer?
The primary limitation is scope: transfer was demonstrated from two source domains to one target domain. Broader transfer across more diverse domains remains to be validated. Additional open questions include optimal source domain count and diversity, the minimum training budget required for effective transfer, and whether the transferred meta-strategies are interpretable. The staged evaluation protocol also introduces trade-offs around early pruning of slow-starting strategies.