The paper review domain in HyperAgents tests whether self-improving AI agents can predict accept/reject decisions for real academic papers. Using historical decisions from ICLR and NeurIPS, this domain produced one of the most dramatic improvement stories in the HyperAgents research: an initial agent that scored 0% due to parsing failures evolved into a structured decision pipeline with significant test accuracy, entirely through automated self-improvement.
- Real conference data — ICLR 2024/2025 and NeurIPS 2023/2024 acceptance decisions
- 300 papers — 100 training, 100 validation, 100 test samples
- Default model — gpt-4o for generating paper review predictions
- 0% to functional — Agent overcame format parsing failures through self-improvement
- Emergent engineering — System autonomously developed multi-criteria scoring and decision logic
Domain Overview: Paper Text to Accept/Reject
The paper review domain within the HyperAgents framework presents a challenging and somewhat unconventional task for AI self-improvement. Given the full text of an academic paper, the agent must predict whether the paper was accepted or rejected at a top machine learning conference. The evaluation metric is straightforward: accuracy against the historical review decision.
This domain was designed by Jenny Zhang and colleagues at Meta to test self-improvement in a setting where the feedback signal is clear (correct or incorrect prediction) but the underlying task is inherently subjective and multifaceted. Unlike code generation where tests provide objective ground truth, paper acceptance decisions reflect the collective judgment of human reviewers who weigh novelty, clarity, significance, experimental rigor, and many other factors. The research paper (arXiv:2603.19461, March 2026) presents this domain as a test of whether self-improvement can work even when the target function is complex and noisy.
Data Source and Scale
The dataset draws from real acceptance decisions at four major conference venues: ICLR 2024, ICLR 2025, NeurIPS 2023, and NeurIPS 2024. These are among the most prestigious machine learning conferences, with rigorous multi-reviewer processes and acceptance rates typically between 20-30%. The data is stored in a dataset.csv file approximately 50MB in size, containing full paper text for each sample.
The dataset is split into three partitions:
- Training set (100 papers) — Used during the self-improvement loop for the meta agent to evaluate modifications
- Validation set (100 papers) — Used for intermediate evaluation and early stopping decisions
- Test set (100 papers) — Held out for final evaluation of the self-improved agent
The relatively small dataset size (compared to, say, the polyglot domain's 225 tasks) reflects the computational cost of processing full paper text through large language models. Each evaluation requires sending a complete paper, often 8-15 pages of dense technical content, to the model for analysis.
Input Format and Pipeline
The input format, defined in utils.py, structures each evaluation instance as a dictionary:
# Input format for paper review evaluation
{
"domain": "paper_review",
"paper_text": row['paper_text'] # Full paper content
}The agent receives the complete paper text and must return a binary accept/reject prediction. The default evaluation model is gpt-4o, chosen for its strong performance on long-context reasoning tasks. The agent's code determines how this paper text is processed: what prompting strategy is used, how the model's output is parsed, and how the final decision is made.
Subset curation is handled by curate_subsets.py, which creates evaluation subsets for staged evaluation. This is particularly important in the paper review domain because full evaluations are expensive, both in terms of API costs (processing 100 full papers through gpt-4o) and wall-clock time.
Staged Evaluation for Cost Control
The paper review domain implements a staged evaluation strategy to manage computational costs. Before running the full 100-task evaluation, the system performs a 10-task pre-evaluation. The agent must achieve at least 1 successful prediction in this pre-evaluation to justify expanding to the full 100-task test set.
This gating mechanism serves two purposes. First, it prevents wasting expensive API calls on agents that are fundamentally broken (as the initial agent was, scoring 0% due to parsing failures). Second, it provides rapid feedback to the meta agent during the self-improvement loop, allowing it to quickly discard modifications that break basic functionality before investing in full evaluation.
The staged approach is especially valuable during early iterations of self-improvement, when the meta agent is still exploring the modification space and many candidate changes may be counterproductive. By failing fast on the 10-task subset, the system avoids the computational cost of running 100 full paper evaluations on agents that cannot even produce valid output.
The Dramatic Improvement Story: From 0% to Functional
Perhaps the most compelling narrative from the HyperAgents paper review domain is the improvement trajectory of the initial agent. When the task agent was first deployed with a basic prompt-and-parse approach, it scored 0% accuracy. Not because the underlying language model could not reason about paper quality, but because the agent's code failed to correctly parse the model's output into a valid accept/reject decision.
This is a crucial observation. The bottleneck was not model intelligence but engineering quality: output formatting, parsing robustness, and decision extraction. Through the DGM-H self-improvement loop, the meta agent identified this failure mode and evolved the task agent's code to include:
- Structured output formatting — Explicit instructions in the prompt requiring the model to output decisions in a parseable format
- Robust parsing logic — Multiple fallback parsing strategies to handle variations in model output
- Multi-criteria scoring — Instead of a single holistic judgment, the evolved agent scores papers on multiple dimensions (novelty, methodology, clarity, significance) and aggregates these into a final decision
- Decision threshold calibration — The system learned to set appropriate thresholds for the accept/reject boundary based on training data performance
This evolution happened entirely through automated self-improvement, without any human engineer manually debugging the parsing code or designing the multi-criteria scoring system. It represents what the research team calls emergent engineering: the self-improving system independently discovered and implemented software engineering best practices.
The paper review domain provides key evidence of emergent engineering in HyperAgents. The system developed its own parsing, multi-criteria scoring, and decision logic without human guidance. This suggests that self-improvement can address not just algorithmic quality but also the software engineering aspects of AI agent design.
Interpreting the Results: Historical Patterns, Not Objective Truth
An important caveat emphasized by Jenny Zhang and co-authors in the paper is that this domain evaluates prediction of historical acceptance decisions, not objective paper quality. Conference acceptance decisions are inherently noisy: the same paper may receive different outcomes at different conferences or even from different reviewer panels at the same conference. Studies have shown substantial disagreement among reviewers, and acceptance decisions reflect not just paper quality but also reviewer expertise, reviewer workload, conference quotas, and other factors.
The HyperAgents paper review domain therefore tests whether an agent can learn patterns in historical decisions, such as which types of contributions tend to be accepted at specific venues. This is fundamentally different from building a system that can make objective quality judgments about scientific work. The authors explicitly state in arXiv:2603.19461 that they do not intend to change peer review processes with this work.
This distinction matters for interpreting the self-improvement results. When the agent improves its accuracy on paper review prediction, it is learning to better match historical patterns, not necessarily learning to be a better judge of scientific quality. The domain's value lies in demonstrating self-improvement capability on a complex, subjective task rather than in the practical application of automated paper review.
Connection to AI Scientist-v2
The HyperAgents paper review domain has an interesting connection to other AI review research. The AI Scientist-v2 project includes a reviewer agent designed to evaluate scientific papers. In the HyperAgents evaluation, this AI Scientist-v2 reviewer agent was used as a static baseline for the paper review domain, providing a reference point against which to measure self-improvement.
The comparison is informative because it highlights the difference between a hand-engineered review agent (AI Scientist-v2's reviewer, designed with explicit review criteria and evaluation rubrics) and a self-improved agent (DGM-H's evolved task agent). The self-improved agent had to discover its own evaluation strategy through trial and error, without access to the carefully designed prompts and evaluation logic of the static baseline. That it achieved competitive or superior performance demonstrates the power of automated self-improvement over manual engineering for complex, subjective tasks.
Academic Fairness and Ethical Considerations
Automating accept/reject predictions for academic papers raises significant ethical concerns. Such systems could perpetuate biases present in historical review data, disadvantage certain research communities, or be misused to game the review process. The HyperAgents research team frames this domain as a research benchmark, not a deployment-ready tool.
The paper review domain touches on sensitive issues in the academic community. Peer review is a cornerstone of scientific quality control, and concerns about its fairness, consistency, and workload are longstanding. While AI-assisted review tools could potentially help manage reviewer workload, deploying automated accept/reject systems raises serious concerns:
- Bias amplification — Historical acceptance decisions contain biases related to author prestige, institutional affiliation, topic popularity, and reviewer preferences. An agent trained on these decisions may learn and amplify these biases.
- Gaming potential — If researchers know the evaluation criteria of an automated system, they may optimize their papers for machine readability rather than scientific substance.
- Evaluation limitations — Paper quality depends on factors that may not be fully captured in the text: novelty relative to concurrent unpublished work, practical implications, and contributions to specific research communities.
- Transparency — Automated review decisions lack the explanatory depth of human reviews, which provide specific feedback on strengths, weaknesses, and suggestions for improvement.
The HyperAgents team positions this domain purely as a research benchmark for self-improvement capability, not as a step toward automated peer review. This framing is responsible and appropriate given the sensitivity of the application area.
Implementation Details
The paper review domain implementation follows the standard HyperAgents evaluation structure. The core evaluation logic processes each paper through the task agent, which sends the paper text to gpt-4o with its evolved prompting strategy, parses the response, and returns a prediction. The evaluation harness then compares predictions against ground truth labels and computes accuracy.
Key implementation files include the dataset management code that loads and processes the 50MB dataset.csv, the evaluation pipeline that handles API calls and retry logic, and curate_subsets.py for creating the staged evaluation subsets. The input formatting in utils.py ensures a consistent interface between the HyperAgents framework and the domain-specific evaluation logic.
The evaluation pipeline is designed to be robust against API failures and rate limiting, as processing 100 papers through gpt-4o requires significant API throughput. The staged evaluation approach helps manage both cost and latency during the iterative self-improvement process.
Frequently Asked Questions
What data does HyperAgents use for paper review evaluation?
HyperAgents uses real acceptance decisions from ICLR 2024/2025 and NeurIPS 2023/2024 conferences. The dataset contains 300 papers split into 100 training, 100 validation, and 100 test samples, stored in a 50MB dataset.csv file with full paper text. These are among the most competitive machine learning conferences, with rigorous multi-reviewer acceptance processes.
How did the HyperAgents paper review agent go from 0% to working?
The initial agent scored 0% due to format parsing failures: the underlying language model could reason about paper quality, but the agent's code could not correctly extract accept/reject decisions from the model's output. Through DGM-H self-improvement, the system autonomously evolved structured decision pipelines with robust parsing, multi-criteria scoring, and explicit decision logic. This transformation happened entirely without human intervention, demonstrating emergent engineering capability.
Does HyperAgents aim to replace human peer review?
No. The authors of arXiv:2603.19461, including Jenny Zhang and colleagues at Meta, explicitly state they do not intend to change peer review processes. The paper review domain evaluates whether an agent can predict historical acceptance decisions, which tests pattern matching against past decisions, not objective quality assessment. The domain serves purely as a research benchmark for self-improvement capability.
What model does HyperAgents use for paper review evaluation?
The default model for paper review evaluation is gpt-4o, chosen for its strong long-context reasoning capabilities. Processing full academic papers (typically 8-15 pages of technical content) requires a model that can maintain coherence across large input contexts. The agent's evolved prompting strategy and parsing logic determine how the model's output is translated into accept/reject predictions.