The task agent is the domain-facing component of HyperAgents — the part that actually solves problems. While the meta agent handles self-modification, the task agent handles everything from generating code patches to reviewing academic papers to grading mathematical proofs to writing robotic reward functions. This guide covers the TaskAgent class, its domain-specific behavior, the LLM integration layer, tool-augmented interfaces, the ensemble mechanism, and how the task agent evolves through meta-agent-driven modification.
- TaskAgent extends AgentSystem with a forward(inputs) method taking a dict with a 'domain' key
- Four domains: polyglot coding, paper review, IMO grading, and robotics
- LiteLLM integration supports Claude, GPT, Gemini, and o3-mini (default)
- The task agent evolves as the meta agent modifies task_agent.py across generations
- Ensemble mechanism aggregates predictions from the best archive agents
The TaskAgent Class
The TaskAgent class is defined in agent/task_agent.py and extends the abstract AgentSystem base class, the same base class used by MetaAgent. The key method is forward(inputs), which receives a dictionary containing a 'domain' key and domain-specific data, processes the input using LLM-powered reasoning, and returns a JSON response with a "response" key.
class TaskAgent(AgentSystem):
"""Domain-specific problem-solving agent."""
def forward(self, inputs: dict) -> dict:
domain = inputs['domain']
# Build domain-specific instruction
instruction = self.build_instruction(domain, inputs)
# Run agent-tool interaction loop
response = self.chat_with_agent(
instruction=instruction,
tools_available=self.domain_tools(domain)
)
# Extract structured JSON from response
result = extract_jsons(response)
return {"response": result}
The forward() method follows a consistent pattern regardless of domain: build an instruction string, run the chat_with_agent loop with domain-appropriate tools, and extract structured output from the LLM's response. The extract_jsons utility (from the utils module) handles multiple extraction patterns, including JSON code blocks, markdown-fenced JSON, and bare JSON objects. This robustness is important because LLMs do not always format their JSON output consistently.
The output format is always a dictionary with a "response" key. The content of the response varies by domain — a code patch for polyglot, a classification for paper review, a score for IMO grading, a Python function for robotics — but the wrapping structure is consistent. This uniform interface allows the evaluation pipeline to handle all domains through the same orchestration code.
The run_task_agent.py CLI
The run_task_agent.py script provides a command-line interface for running the task agent outside of the full generate_loop pipeline. This is useful for testing individual agent variants against specific problems, debugging domain-specific behavior, and running standalone evaluations.
The CLI accepts the following arguments:
--problem_statement— The problem to solve, formatted according to the domain's conventions--git_dir— Path to the repository (used primarily for polyglot tasks where the agent modifies code)--base_commit— The git commit to use as the starting point for changes--chat_history_file— Optional path to a file containing previous interaction history for multi-turn tasks--test_description— Description of how the task output will be evaluated--language— Programming language for polyglot tasks--model— LLM model to use (default:o3-mini)
# Example: Run task agent on a polyglot coding problem
python run_task_agent.py \
--problem_statement "Add error handling to the parse_config function" \
--git_dir /repo \
--base_commit abc123 \
--test_description "Tests check that invalid config files raise ConfigError" \
--language python \
--model o3-mini
The default model choice of o3-mini reflects a practical trade-off between capability and cost. For the task agent, which runs many times across many problems, a smaller and faster model reduces the compute budget consumed per generation. The meta agent, which runs once per generation and makes high-stakes decisions, can afford a more capable (and expensive) model.
Domain-Specific Behavior
The task agent's behavior varies significantly across the four domains evaluated in the HyperAgents paper by Jenny Zhang et al. (arXiv:2603.19461). Each domain has its own input format, output format, evaluation criteria, and tool requirements. Understanding these domain-specific behaviors is essential for understanding how HyperAgents achieve cross-domain generalization.
Polyglot Coding
In the polyglot coding domain, the task agent receives a repository and an instruction describing a code change to make. The agent reads the relevant source files, reasons about the required modification, and produces a code patch. The patch is applied to the repository and evaluated by running the project's test suite. The score is based on the percentage of tests that pass.
This domain exercises the task agent's ability to understand codebases, reason about program semantics, and produce syntactically valid code across multiple programming languages. The "polyglot" aspect means the agent must handle Python, JavaScript, TypeScript, Java, Go, Rust, and other languages, requiring broad knowledge of syntax and idioms.
The tool set for polyglot tasks includes file reading, file writing, shell execution (for running tests), and git operations (for creating patches). The agent typically reads the relevant files, understands the existing code structure, writes modifications, and runs tests to verify its changes before producing the final patch.
Paper Review
In the paper review domain, the task agent receives the full text of an academic paper and must predict whether it would be accepted or rejected at a top-tier venue. The input is a dictionary with 'domain': 'paper_review' and 'paper_text' containing the paper content. The output is a classification: accept or reject.
This domain tests the task agent's ability to evaluate research quality, identify methodological strengths and weaknesses, assess novelty, and predict human reviewer decisions. It is a high-level reasoning task that requires understanding of research norms, statistical methodology, and domain-specific conventions.
Scoring is straightforward: the agent's prediction is compared against the ground-truth outcome (actual accept/reject decision). The task agent is evaluated on accuracy across a corpus of papers with known outcomes. No tool use is required for this domain — the agent operates purely on the input text.
IMO Grading
In the IMO (International Mathematical Olympiad) grading domain, the task agent receives a mathematical problem, a student's written solution, and a scoring rubric. It must assign a discrete score from the set {0, 1, 6, 7}, which represents the standard IMO scoring scale where 0 means no progress, 1 means minor progress, 6 means nearly complete solution with minor gaps, and 7 means a complete and correct solution.
This domain tests mathematical reasoning and the ability to evaluate proof quality against a rubric. The task agent must understand the problem, follow the student's reasoning, identify errors or gaps, and map its assessment to the discrete scoring scale. The challenge is that mathematical proofs can be correct but poorly written, or elegantly written but subtly flawed.
Scoring is based on agreement with human expert graders. The task agent receives the ground-truth score and is evaluated on how often its assigned score matches the expert assessment.
Robotics
In the robotics domain, the task agent receives a natural language task description (e.g., "Make the robot pick up the red block and place it on the blue platform") and must generate a Python reward function that, when used to train a reinforcement learning agent in simulation, produces the desired behavior.
This domain tests the agent's ability to translate high-level task specifications into precise mathematical reward signals. Writing good reward functions is notoriously difficult — a poorly designed reward function can lead to reward hacking, where the RL agent finds unexpected ways to maximize the reward without actually completing the intended task.
The output is a Python function that takes simulation state as input and returns a scalar reward. Evaluation runs the RL training pipeline with the generated reward function and scores the resulting behavior against the task specification. This is the most computationally expensive domain and benefits from the GPU support provided by the Docker execution environment (nvidia/cuda:13.0.0-devel-ubuntu22.04 with EGL rendering).
| Domain | Input | Output | Evaluation |
|---|---|---|---|
| Polyglot | Repo + instruction | Code patch | Test suite pass rate |
| Paper Review | Paper text | Accept / reject | Accuracy vs. ground truth |
| IMO Grading | Problem + solution + rubric | Score (0/1/6/7) | Agreement with human graders |
| Robotics | Task description | Python reward function | RL training outcome |
The LLM Integration Layer
The task agent (and meta agent) communicate with LLMs through the llm.py integration layer. This module provides a unified interface to multiple model providers using LiteLLM, an open-source library that abstracts away provider-specific API differences.
The supported models include:
- Claude — sonnet-4.5 from Anthropic, known for strong reasoning and careful output
- GPT — gpt-5.2 and gpt-5 from OpenAI, the latest frontier models
- Gemini — gemini-3-pro-preview from Google, with strong multimodal capabilities
- o3-mini — OpenAI's efficient reasoning model, used as the default for task agent due to its cost-efficiency
The llm.py module includes exponential backoff retry logic for handling transient API failures. When a model call fails due to rate limiting, network issues, or server errors, the system waits an exponentially increasing amount of time before retrying. This is essential for production reliability, especially during long evolutionary runs where thousands of API calls are made across many concurrent agent variants.
Model selection affects the quality of both task solving and self-modification. More capable models produce better initial task solutions and more insightful self-modifications. However, they also cost more per call, which matters when running population-based search with many concurrent evaluations. The choice of model is a hyperparameter that researchers can tune based on their compute budget and quality requirements.
Tool-Augmented Interface
The llm_withtools.py module extends the basic LLM interface with tool-calling capabilities. When the task agent or meta agent needs to interact with external systems (file system, shell, git), it uses this tool-augmented interface.
The module handles three key responsibilities:
- Tool use prompt generation — It constructs a system prompt that describes the available tools, their parameters, and their expected formats. This prompt is prepended to the conversation so the LLM knows what tools are available and how to invoke them.
- JSON tool call parsing — When the LLM outputs a tool call in the
<json>{"tool_name": "...", "tool_input": {...}}</json>format, the module parses the JSON, validates the tool name and parameters, and dispatches the call to the appropriate handler. - Multi-turn interaction management — The module maintains the conversation history across multiple turns, appending tool call results to the history and re-prompting the LLM until it produces a final answer without tool calls.
The extract_jsons utility function, used to parse the task agent's final output, employs multiple pattern matching strategies. It tries to extract JSON from dedicated JSON code blocks first, then falls back to markdown-fenced code blocks, and finally attempts to parse bare JSON objects from the response text. This multi-strategy approach ensures robust output parsing even when the LLM's formatting is inconsistent — a common challenge when working with diverse model providers.
How the Task Agent Evolves
The task agent is not a static component. This is one of the defining features of the HyperAgents DGM-H architecture: the meta agent modifies task_agent.py across generations, creating new versions with improved strategies. Understanding how this evolution works is key to understanding why HyperAgents outperform fixed agents over time.
When the meta agent runs, it examines the task agent's recent performance across all domains. It reads task_agent.py, understands its current strategy, identifies weaknesses, and writes modifications. These modifications can range from minor prompt adjustments to major architectural changes.
Common types of modifications observed in the HyperAgents research include:
- Prompting improvements — Adding chain-of-thought instructions, few-shot examples, or domain-specific heuristics to the task agent's prompts
- Tool usage patterns — Teaching the task agent to use tools more effectively, such as reading test files before writing code or running intermediate checks during problem solving
- Output formatting — Improving the JSON output structure to reduce parsing failures, which directly impacts evaluation scores
- Domain routing — Adding conditional logic that applies different strategies for different domains, rather than using a one-size-fits-all approach
- Error recovery — Adding retry logic, fallback strategies, and error handling to make the task agent more robust against edge cases
Over many generations, these accumulated modifications transform the task agent from a simple prompt-and-respond system into a sophisticated problem-solving pipeline with domain-specific strategies, error recovery, and self-monitoring capabilities. The meta agent's modifications are preserved in the archive, so successful innovations are carried forward and built upon by future generations.
Jenny Zhang et al. report in arXiv:2603.19461 that after dozens of generations, the task agent's code bore little resemblance to the original hand-written version. The meta agent had added utility functions, restructured the control flow, introduced domain-specific preprocessing, and developed novel prompting strategies — all through autonomous self-modification.
The Ensemble Mechanism
The ensemble.py module provides a mechanism for aggregating predictions from the best-performing agent variants in the archive. Rather than relying on a single agent variant for predictions, the ensemble mechanism identifies the top performers and combines their outputs.
The ensemble works by scanning the archive for the highest-scoring agents on each domain. For a given problem, the ensemble runs the task agent from each selected variant and aggregates the results. For classification tasks (like paper review), this could be a majority vote. For scoring tasks (like IMO grading), it could be an average or median. For generation tasks (like polyglot coding), the ensemble may select the output from the variant with the highest historical score on that domain.
This mechanism provides several benefits. First, it reduces variance — a single agent variant might have a bad response to a particular problem, but the ensemble smooths out individual failures. Second, it leverages diversity — different agent variants may have different strengths, and the ensemble captures the best of each. Third, it provides a natural baseline — the ensemble's performance represents the collective knowledge of the entire evolutionary history.
The ensemble mechanism is implemented in agent/ensemble.py and is used primarily during final evaluation, after the evolutionary run is complete. During the evolutionary process itself, individual agent variants are evaluated independently to maintain clean selection pressure.
Error Handling and Robustness
Robustness is critical for a system that runs thousands of evaluations across diverse domains with varying LLM responses. The task agent incorporates several layers of error handling to ensure that transient failures do not corrupt the evolutionary process.
The extract_jsons utility uses multiple pattern matching strategies to parse the task agent's JSON output. It first looks for JSON blocks delimited by the <json> tag format. If that fails, it tries markdown code blocks (triple backticks with a json language specifier). If that fails, it attempts to find bare JSON objects using brace matching. This cascade of fallback strategies handles the wide variety of output formats that different LLMs produce.
At the LLM integration level, llm.py implements exponential backoff retry for API failures. The retry logic distinguishes between retryable errors (rate limits, server errors, network timeouts) and non-retryable errors (invalid API keys, malformed requests). This prevents the system from wasting time on requests that will never succeed while being resilient to transient infrastructure issues.
At the evaluation level, each domain's scoring function handles edge cases: empty responses default to the lowest score, malformed outputs are penalized but do not crash the pipeline, and timeout protections prevent runaway evaluations (particularly important for the robotics domain where RL training can potentially run indefinitely).
Connection to the Broader HyperAgents System
The task agent does not operate in isolation. It is one component of the larger DGM-H architecture, working alongside the meta agent and the open-ended exploration mechanism. The task agent solves problems; the meta agent improves how the task agent solves problems; and the archive and selection mechanisms ensure that the best improvements are preserved and built upon.
This three-way interaction creates the self-referential loop that is the hallmark of the HyperAgents system. The task agent produces scores. The scores inform the meta agent's modification decisions. The meta agent modifies the task agent. The modified task agent produces different scores. The cycle continues, generation after generation, with the archive preserving successful innovations as stepping stones for future discovery.
For researchers implementing or extending HyperAgents, the task agent is the most frequently modified component. Both human developers and the meta agent tend to focus their efforts here, because changes to task-solving logic have the most direct impact on evaluation scores. The meta agent is modified less frequently but with greater impact — changes to the improvement strategy affect all future generations.
Frequently Asked Questions
What is the task agent in HyperAgents?
The task agent is the domain-specific problem-solving component of HyperAgents, implemented as the TaskAgent class in agent/task_agent.py. It extends AgentSystem, receives inputs with a 'domain' key and domain-specific data, and returns JSON responses. It handles polyglot coding (producing code patches), paper review (accept/reject predictions), IMO math grading (discrete scores), and robotics (Python reward functions).
How does the task agent differ across domains?
In polyglot coding, the task agent receives a repository and instruction, producing a code patch evaluated by test suites. In paper review, it processes paper text and predicts accept or reject. In IMO grading, it evaluates a problem-solution-rubric triple and assigns a discrete score from {0, 1, 6, 7}. In robotics, it generates a Python reward function from a task description. Each domain uses different tools and evaluation criteria.
What LLM models does HyperAgents support?
HyperAgents uses LiteLLM for multi-model support. Supported models include Claude (sonnet-4.5), GPT (gpt-5.2, gpt-5), Gemini (gemini-3-pro-preview), and o3-mini (the default for the task agent CLI). The llm.py integration layer handles model routing and includes exponential backoff retry logic for reliability during extended evolutionary runs.
How does the task agent evolve over time?
The meta agent modifies task_agent.py across generations, creating new versions with improved strategies. Modifications include better prompting, new tool usage patterns, domain-specific heuristics, improved output formatting, and error recovery. The ensemble.py mechanism identifies the best-performing variant from the archive for each domain, aggregating predictions from top performers.