HyperAgents IMO Math Grading Domain

Q: What is the IMO grading task in HyperAgents?

The IMO grading task takes a math problem, a student solution, a reference solution, and grading guidelines as input, and produces a discrete score of 0, 1, 6, or 7. These scores follow IMO conventions where 0 means no progress, 1 means minor progress, 6 means nearly complete, and 7 means perfect. Data comes from IMO-GradingBench with 1000 human-scored samples.

Q: What evaluation metrics does HyperAgents use for IMO grading?

HyperAgents uses two metrics for IMO grading: accuracy (exact match between predicted and ground-truth scores) and MAE (Mean Absolute Error), which measures the average magnitude of scoring errors. Both metrics are computed against human-scored ground truth from the IMO-GradingBench dataset.

The IMO grading domain is the primary transfer target in the HyperAgents framework, meaning it is the domain used to demonstrate that self-improvement ability transfers across completely different tasks. Meta agents trained on paper review and robotics were frozen and applied to IMO grading, an unseen domain, producing the headline result: imp@50 = 0.630 from transferred DGM-H versus 0.0 from hand-crafted DGM. This result is central to the HyperAgents thesis that "improvement ability" is domain-independent.

Key Points

Task — Problem + student solution + reference solution + rubric to discrete score (0/1/6/7)
Transfer target — Never used during meta agent training; tests cross-domain generalization
Key result — Transferred DGM-H: imp@50 = 0.630 vs hand-crafted DGM: imp@50 = 0.0
Scale — Train 100, Test 100 from IMO-GradingBench (1000 human-scored samples)
Metrics — Accuracy and MAE (Mean Absolute Error) against human ground truth

Domain Overview: Grading Mathematical Proofs

The International Mathematical Olympiad (IMO) is the most prestigious mathematics competition for pre-university students, featuring problems that demand deep mathematical reasoning, creative proof construction, and rigorous argumentation. Grading IMO solutions is itself a challenging intellectual task: expert graders must evaluate whether a student's proof is logically correct, identify partial progress, and assign a score that reflects the quality and completeness of the solution.

In the HyperAgents framework, the IMO grading domain formalizes this as an AI evaluation task. The agent receives four inputs: a mathematical problem, a student's solution attempt, a reference solution, and grading guidelines. It must output a discrete score from the set {0, 1, 6, 7}, following IMO grading conventions:

Score	Meaning	Description
0	No progress	The solution does not demonstrate meaningful progress toward the answer
1	Minor progress	Some relevant observations or partial steps, but far from a complete solution
6	Nearly complete	Substantial progress with only minor gaps or errors in the argument
7	Perfect	Complete, correct solution with rigorous argumentation

This scoring system, with its gap between 1 and 6, reflects IMO grading practice where intermediate scores (2-5) exist but are less common. The four-value discrete output makes this a challenging classification problem: the agent must distinguish between "no progress" and "minor progress" at the low end, and between "nearly complete" and "perfect" at the high end, requiring nuanced understanding of mathematical proof structure.

The dataset comes from IMO-GradingBench, a benchmark containing 1000 human-scored IMO solution attempts. Jenny Zhang and the research team at Meta selected 200 samples for the HyperAgents evaluation, split into 100 training and 100 test instances. The default evaluation model is gpt-o4-mini-genai, as described in arXiv:2603.19461 (March 2026).

Input Format and Processing

The input format, defined in grading_utils.py, structures each evaluation instance as a dictionary containing all four components needed for grading:

# Input format for IMO grading evaluation
{
    "domain": "imo_grading",
    "problem": "Let n be a positive integer. Find all functions...",
    "solution": "[Reference solution text]",
    "grading_guidelines": "[Rubric with scoring criteria]",
    "student_answer": "[Student's attempted proof]"
}

The grading_utils.py module provides the format_input_dict function that constructs this dictionary from raw dataset rows, along with constants that define the domain configuration. The grading guidelines provide rubric-specific criteria that an expert grader would use, including common approaches, partial credit criteria, and typical errors.

This rich input structure distinguishes IMO grading from simpler classification tasks. The agent has access to all the information a human grader would use, including the reference solution for comparison and explicit grading guidelines. The challenge lies in correctly applying these guidelines to evaluate the student's specific solution attempt.

The Transfer Experiment

The IMO grading domain's most important role in the HyperAgents research is as the primary transfer target. This experiment is the centerpiece of the paper's cross-domain generalization claims. The experimental design is as follows:

Step 1: Train Meta Agents on Source Domains

Meta agents were trained using the DGM-H (Darwin Godel Machine - HyperAgents) architecture on two source domains: paper review and robotics reward design. During this training, the meta agents learned to modify task agents to improve performance on these specific domains. Crucially, the IMO grading domain was completely withheld, as the meta agents never observed any IMO grading data, tasks, or evaluation results during training.

Step 2: Freeze and Transfer

After training, the meta agents were frozen: their code was fixed and no further modifications were allowed. These frozen meta agents were then applied directly to the IMO grading domain. The meta agent's learned strategies for improving task agents, developed in the context of paper review and robotics, were tested on the completely unseen mathematical grading task.

Step 3: Compare Against Hand-Crafted Baseline

The transferred DGM-H meta agent was compared against a hand-crafted DGM meta agent, which is the original Darwin Godel Machine meta agent designed by human researchers. This baseline represents the best that manual meta-agent engineering can achieve without the self-referential self-improvement of DGM-H.

The Results

Transferred DGM-H Meta

imp@50 = 0.630 — After 50 iterations, the transferred meta agent achieved a 63% improvement rate on IMO grading tasks, despite never having been trained on this domain.

Hand-Crafted DGM Meta

imp@50 = 0.0 — The manually designed meta agent achieved zero improvement on IMO grading, unable to discover effective optimization strategies for this unfamiliar domain.

This stark contrast, 0.630 versus 0.0, is the strongest evidence in the HyperAgents paper for the claim that self-referential self-improvement produces transferable optimization capabilities. The hand-crafted DGM meta agent, while effective on its training domains, could not generalize its improvement strategies to an unseen task. The DGM-H meta agent, which had learned to improve itself during training, developed more general improvement strategies that transferred to a completely different evaluation domain.

Why This Result Is Significant

The transfer result has profound implications for the field of AI self-improvement. If improvement ability were domain-specific, meaning an agent that learns to improve code generation can only improve code generation, then self-improvement would be useful but limited. Each new domain would require training a new meta agent from scratch.

The IMO grading transfer result suggests something more powerful: that "improvement ability" is at least partially domain-independent. The meta agent learned general strategies for modifying task agents that work across domains, from predicting paper acceptance to designing reward functions to grading mathematical proofs. These general strategies might include:

Systematic prompt engineering — Learning to decompose complex evaluation tasks into structured sub-evaluations
Output formatting — Developing robust parsing and validation strategies for extracting structured outputs from LLMs
Error analysis patterns — Identifying common failure modes and developing mitigation strategies
Evaluation pipeline design — Learning to build multi-stage evaluation pipelines that improve reliability

These are fundamentally software engineering skills, not domain-specific knowledge. The DGM-H meta agent appears to have learned general principles of agent engineering that apply regardless of the specific task domain.

Evaluation Metrics: Accuracy and MAE

The IMO grading domain uses two complementary evaluation metrics:

Accuracy measures exact match between the predicted score and the human ground truth. A prediction of 6 when the true score is 7 counts as incorrect, even though it is close. This is a strict metric that rewards precise calibration of the scoring model.

MAE (Mean Absolute Error) measures the average magnitude of scoring errors. A prediction of 6 when the true score is 7 has an MAE contribution of 1, while a prediction of 0 when the true score is 7 has an MAE contribution of 7. This metric captures how "wrong" incorrect predictions are, penalizing large errors more than small ones.

Using both metrics provides a more complete picture of agent performance. An agent with moderate accuracy but low MAE may be making many near-misses (e.g., predicting 6 instead of 7), while an agent with the same accuracy but high MAE may be making fewer but more catastrophic errors (e.g., predicting 0 instead of 7). For practical applications of mathematical grading, low MAE may be more important than high accuracy, as near-miss errors are less consequential than gross misgrading.

Implementation Details

The IMO grading domain implementation lives in domains/imo/ within the HyperAgents repository. The codebase includes several specialized modules:

Core Modules

grading_utils.py provides the domain constants and the format_input_dict function that constructs the input dictionary from raw dataset rows. This module defines the interface between the HyperAgents evaluation framework and the IMO-specific grading logic.

proof_eval.py implements the proof evaluation logic, which processes the agent's output and extracts the discrete score. This module handles the parsing of LLM responses, validation of scores against the allowed set {0, 1, 6, 7}, and comparison against ground truth.

setup_proofgrader_repo.py (approximately 11KB) handles the setup of the ProofGrader evaluation environment. This is the largest module in the IMO domain and manages the initialization and configuration of the proof grading pipeline, including model loading, evaluation environment setup, and result serialization.

Data Management

curate_subsets.py creates evaluation subsets for staged evaluation, similar to other HyperAgents domains. This allows rapid iteration during self-improvement without running the full 100-task evaluation at every step.

analyze_imo_proof.py provides proof analysis utilities for debugging and understanding agent behavior. This module helps researchers examine how the agent processes specific proofs and where its grading decisions go wrong.

Challenges in Mathematical Grading

Mathematical proof grading is inherently challenging for several reasons that make this domain a demanding test of self-improvement:

Rubric Interpretation

IMO grading guidelines provide criteria for awarding partial credit, but applying these criteria requires judgment. A rubric might state "award 1 point for correctly identifying the key algebraic identity," but determining whether a student's work constitutes "correctly identifying" the identity requires understanding the mathematical content at a deep level. The agent must interpret rubric language in the context of specific mathematical arguments.

Proof Structure Analysis

Evaluating mathematical proofs requires understanding logical structure: premises, deductions, lemmas, and conclusions. A proof may be essentially correct but poorly organized, or it may appear well-structured but contain a subtle logical gap. The agent must assess not just surface-level features of the solution text but the underlying logical validity of the argument.

Calibration Sensitivity

The gap between scores 1 and 6 in the IMO scoring system means that calibration is critical. Misclassifying a "minor progress" solution (score 1) as "nearly complete" (score 6) is a major error, yet the distinction can be subtle. The agent must develop calibration that matches human graders' standards, which may vary somewhat between graders and across different problem types.

Bias Risks

Mathematical grading systems can exhibit biases related to proof style (formal vs. informal), notation conventions, language quality (many IMO participants write in their second or third language), and solution approach (standard vs. creative). An agent trained on human-scored data may inherit these biases, potentially disadvantaging certain solution styles.

Grading Sensitivity

Mathematical grading is sensitive to calibration and rubric interpretation. The discrete scoring system {0, 1, 6, 7} amplifies the impact of borderline decisions. While the HyperAgents IMO domain provides a valuable benchmark for self-improvement research, any practical deployment of automated mathematical grading would require extensive validation against human grading standards and careful consideration of fairness implications.

Connection to IMO-GradingBench

The HyperAgents IMO domain draws its data from IMO-GradingBench, a benchmark specifically designed for evaluating automated mathematical proof grading. IMO-GradingBench contains 1000 human-scored solution attempts spanning multiple years of IMO problems. The scores were assigned by experienced IMO graders following official rubrics, providing high-quality ground truth labels.

The HyperAgents team selected 200 samples from this benchmark (100 training, 100 test), choosing a subset that provides good coverage across problem types, difficulty levels, and score distributions. The default evaluation model, gpt-o4-mini-genai, was chosen to balance evaluation quality against computational cost, as processing mathematical proofs requires capable but not necessarily maximal language models.

IMO-GradingBench itself represents ongoing research into AI-assisted mathematical evaluation, and the HyperAgents results add a new dimension to this work: demonstrating that self-improving agents can develop mathematical grading capabilities through cross-domain transfer, without explicit training on mathematical tasks.

Implications for Cross-Domain Self-Improvement

The IMO grading transfer experiment is the strongest evidence in the HyperAgents research for the existence of transferable improvement skills. The result suggests a research direction where meta agents could be trained on computationally cheap or easily evaluated domains and then transferred to expensive or difficult-to-evaluate domains, dramatically reducing the cost of developing self-improving agents for new applications.

However, the result also raises questions. Is the transfer specific to the source domain combination (paper review + robotics), or would other source domains work equally well? Does transfer quality degrade as the target domain becomes more dissimilar from the source domains? Can the transferred meta agent continue to improve on the target domain, or does it plateau after the initial transfer? These questions point to important future work in the HyperAgents research program.

Frequently Asked Questions

What is the IMO grading task in HyperAgents?

The IMO grading task takes four inputs: a math problem, a student's solution attempt, a reference solution, and grading guidelines. It must output a discrete score from {0, 1, 6, 7}, following IMO conventions where 0 means no progress, 1 means minor progress, 6 means nearly complete, and 7 means a perfect solution. The data comes from IMO-GradingBench, which contains 1000 human-scored samples from International Mathematical Olympiad competitions.

Why is IMO grading the primary transfer target for HyperAgents?

IMO grading was chosen as the transfer target because it was completely withheld during meta agent training. Meta agents were trained on paper review and robotics domains, then frozen and applied to IMO grading without any adaptation. This experimental design directly tests whether "improvement ability" is domain-independent, which is the central thesis of the HyperAgents research by Jenny Zhang et al. at Meta.

What does imp@50 = 0.630 mean in HyperAgents results?

The imp@50 metric measures improvement after 50 iterations of meta agent optimization. A value of 0.630 means the transferred DGM-H meta agent achieved a 63% improvement rate on IMO grading tasks after 50 iterations, despite never having been trained on mathematical grading. In contrast, the hand-crafted DGM meta agent achieved imp@50 = 0.0 (zero improvement), demonstrating that self-referential self-improvement discovers transferable optimization strategies that hand-designed approaches cannot match.

What evaluation metrics does HyperAgents use for IMO grading?

HyperAgents uses two metrics: accuracy (exact match between predicted and ground-truth scores) and MAE (Mean Absolute Error, measuring average magnitude of scoring errors). Both are computed against human-scored ground truth from IMO-GradingBench. Using both metrics captures both precision (accuracy) and severity of errors (MAE), providing a complete picture of grading quality.

← Back to HyperAgents Overview