HyperAgents Robotics Reward Design

Q: What physics simulator does HyperAgents use for robotics evaluation?

HyperAgents uses the Genesis physics simulator for training and evaluating quadruped robot (Go2) locomotion policies. Genesis enables massively parallel simulation with 4096 environments running simultaneously, each with 4-second episodes at a dt=0.02 timestep.

Q: How does HyperAgents evaluate reward functions for robotics?

The evaluation runs a 20-second simulation with PPO-trained policies. Fitness scores range from 0 to 1, with early termination if the robot falls (roll or pitch exceeds 10 degrees). Speed commands are randomly sampled per episode to test robustness across locomotion targets.

Q: What is the training vs test task for the robotics domain?

The training task is 'walk forward' (go2walking domain), where the agent learns to design reward functions for forward locomotion. The test task is fundamentally different: zero-shot generation of reward functions to maximize torso height, requiring generalization of reward engineering skills to an unseen objective.

Q: How does HyperAgents handle GPU resources for robotics simulation?

The implementation includes a dedicated gpu_selector.py module for GPU resource management. Docker containers run with NVIDIA CUDA support, configured with CUDA library paths and EGL for headless rendering, enabling simulation without a display server.

The robotics domain in HyperAgents tests self-improving agents on one of the most challenging problems in robot learning: designing reward functions. Using the Genesis physics simulator, agents generate Python reward functions that train quadruped locomotion policies via PPO. The training task is "walk forward," while the test demands zero-shot generalization to a fundamentally different objective: maximizing torso height. This domain bridges LLM-based self-improvement with physical simulation, connecting the HyperAgents framework to the broader reward engineering research landscape.

Key Points

Genesis simulator — Quadruped Go2 robot with 4096 parallel environments
Task format — Natural language description to Python reward function to PPO training to task score
Training task — "walk forward" (go2walking); Test task: "maximize torso height" (zero-shot)
Evaluation — 20-second sim, 4-second episodes, 0-1 fitness score, early termination on falls
GPU-accelerated — NVIDIA CUDA Docker support with EGL headless rendering

Domain Overview: From Language to Locomotion

Reward function design sits at a critical bottleneck in reinforcement learning. While RL algorithms like PPO can train policies to optimize any reward signal, specifying the right reward function for a desired behavior remains a manual, iterative, and often frustrating process. The HyperAgents robotics domain reframes this as an AI self-improvement challenge: can a self-modifying agent learn to design better reward functions through automated experimentation?

The pipeline works as follows. The agent receives a natural language task description (e.g., "make the robot walk forward"). It generates a Python reward function that maps the robot's state, including joint positions, velocities, body orientation, and foot contacts, to a scalar reward signal. This reward function is then used to train a PPO policy in the Genesis physics simulator. Finally, the trained policy is evaluated on the task, producing a fitness score between 0 and 1.

This multi-stage pipeline, from language to code to simulation to evaluation, makes the robotics domain uniquely challenging among the HyperAgents evaluation domains. Each evaluation is computationally expensive (requiring GPU-accelerated physics simulation and RL training), the feedback is delayed (the quality of a reward function only becomes apparent after training), and the search space is vast (there are infinitely many Python functions that could serve as reward functions).

Jenny Zhang, Bingchen Zhao, and the broader research team at Meta describe this domain in detail in their paper (arXiv:2603.19461, March 2026), positioning it as a test of whether self-improvement can operate effectively through indirect, delayed feedback signals.

The Genesis Physics Simulator

Genesis is a differentiable physics simulator designed for robot learning research. The HyperAgents robotics domain uses Genesis to simulate a Go2 quadruped robot, a four-legged platform commonly used in legged locomotion research. Genesis provides several advantages for this evaluation:

Massively parallel simulation — 4096 environments run simultaneously on a single GPU, enabling rapid policy training
Accurate rigid-body physics — Realistic contact dynamics, joint limits, and motor models for the Go2 platform
Differentiable dynamics — While not directly used by HyperAgents, Genesis's differentiability supports gradient-based reward optimization in related work
Fast iteration — GPU acceleration enables training a full locomotion policy in minutes rather than hours

The simulation setup uses a timestep of dt=0.02 seconds with 4-second episodes (200 simulation steps per episode). Full evaluation runs for 20 seconds of simulated time, during which the trained policy is tested under varying conditions. Speed commands are randomly sampled per episode, testing whether the policy can handle different locomotion targets rather than just one fixed velocity.

Training and Test Tasks

The robotics domain uses a carefully designed train/test split that tests genuine generalization of reward engineering skills rather than memorization.

Training Task: Walk Forward

The training task is the go2walking domain: design a reward function that makes the Go2 quadruped walk forward effectively. Forward locomotion is one of the most studied problems in legged robotics, and there is substantial prior work on reward function design for this task. The meta agent uses this task during self-improvement to evaluate modifications: does a change to the task agent's reward-generation code produce better walking policies?

A good walking reward function must balance multiple objectives: forward velocity (primary goal), energy efficiency (minimizing joint torques), stability (maintaining upright posture), smoothness (avoiding jerky movements), and foot contact patterns (proper gait timing). The agent must discover appropriate reward terms and weightings for these objectives through its generated Python code.

Test Task: Maximize Torso Height

The test task is fundamentally different from the training task: generate a reward function that maximizes the robot's torso height. This is a zero-shot transfer test. The agent has never been evaluated on this objective during training. It must apply the reward engineering skills learned from the walking task to a completely different locomotion objective.

Maximizing torso height requires a different set of reward terms than walking. Instead of forward velocity, the agent must reward upward displacement. Stability becomes even more critical, as the robot needs to maintain an upright posture while reaching its maximum height, potentially by rearing up on its hind legs. This zero-shot generalization test reveals whether the self-improved agent has learned general reward engineering principles or merely memorized walking-specific reward templates.

Evaluation Metrics and Early Termination

The fitness score ranges from 0 to 1, with higher values indicating better task performance. The score is computed from the trained policy's behavior during the 20-second evaluation simulation. For the walking task, this primarily measures forward velocity; for the torso height task, it measures the achieved height.

A critical aspect of the evaluation is early termination on robot falls. If the robot's roll or pitch exceeds 10 degrees, the episode terminates immediately and the remaining timesteps receive zero reward. This termination condition serves dual purposes:

Safety proxy — In real robotics, a fallen robot represents a catastrophic failure. Penalizing falls encourages the agent to design reward functions that prioritize stability.
Evaluation efficiency — Terminating failed episodes early saves computation during the evaluation of poorly performing reward functions.

The 10-degree threshold is relatively strict, requiring policies that maintain tight postural control throughout the evaluation. This prevents reward functions that exploit high-risk, high-reward strategies where the robot achieves brief moments of good performance interspersed with frequent falls.

Implementation Structure

The robotics domain implementation lives in domains/genesis/ within the HyperAgents repository. The codebase is organized into several specialized modules:

Core Evaluation

evaluator.py (approximately 24KB) implements the EvaluatorManager class, which orchestrates multi-agent evaluation. This is the largest module in the robotics domain, handling parallel evaluation of multiple reward functions, resource allocation across GPUs, and result aggregation. The EvaluatorManager coordinates the full pipeline from reward function generation through PPO training to policy evaluation.

eval.py provides Hydra-based configuration management through an AgentFactory pattern. Hydra configuration allows researchers to easily swap between different agent configurations, evaluation parameters, and simulation settings without modifying code. This is particularly useful during self-improvement experiments where different meta agent modifications may require different evaluation configurations.

Simulation Utilities

genesis_utils.py provides simulation utility functions that interface between the HyperAgents evaluation code and the Genesis simulator API. This includes state normalization, observation construction, and reward computation wrappers.

gpu_selector.py handles GPU resource management, which is critical for the robotics domain. Unlike the other HyperAgents evaluation domains (polyglot, paper review, IMO grading) that primarily use CPU and API calls, robotics evaluation requires dedicated GPU resources for physics simulation and policy training. The GPU selector ensures that evaluation tasks are distributed across available GPUs without contention.

Training and Evaluation Pipelines

The genesis_eval/ and genesis_train/ directories contain the evaluation and training pipeline code respectively. The training pipeline implements PPO with the Genesis-specific environment interface, while the evaluation pipeline tests trained policies under the standardized evaluation conditions (20-second sim, random speed commands, early termination).

Reward Function Management

The reward/ directory contains reward function templates and management code. This includes base templates that the agent modifies, utilities for loading and validating generated reward functions, and the interface between generated Python code and the Genesis reward computation pipeline.

The environments/ directory defines the simulation environment configurations, including the Go2 robot model, terrain parameters, and physics properties.

# Simplified reward function structure
def compute_reward(state, action, env_config):
    # State includes: joint positions, velocities,
    # body orientation, foot contacts, torso height

    # Agent-generated reward terms
    forward_vel = state.base_lin_vel[:, 0]
    stability = -torch.abs(state.base_euler[:, :2]).sum(dim=1)
    energy = -torch.sum(action ** 2, dim=1)

    # Weighted combination (weights evolved by self-improvement)
    reward = 1.0 * forward_vel + 0.5 * stability + 0.01 * energy
    return reward

Docker GPU Support

The robotics evaluation requires GPU access inside Docker containers, which adds infrastructure complexity beyond the other HyperAgents domains. The Docker configuration uses nvidia/cuda as the base image, with specific CUDA library paths configured for Genesis compatibility. EGL (a rendering API) is configured for headless rendering, enabling simulation visualization and debugging without a display server, which is essential for automated evaluation on headless compute nodes.

The Docker setup ensures that the NVIDIA runtime is properly configured, CUDA libraries are accessible at known paths, and the Genesis simulator can access GPU resources for parallel simulation. This infrastructure code, while not intellectually glamorous, is essential for reproducible robotics evaluation and represents a significant engineering effort by the HyperAgents team.

Reward Hacking and Alignment Risks

Reward Hacking Risk

When agents design their own reward functions, there is an inherent risk of reward hacking: the agent may discover reward functions that achieve high fitness scores through unintended behaviors that do not align with the true task intent. The evaluation metrics may not perfectly capture what constitutes "good" locomotion.

Reward hacking is a well-documented challenge in reinforcement learning. When the agent generates reward functions optimized for a fitness score, it may discover degenerate solutions. For example, a reward function for "walk forward" might produce a policy that drags itself along the ground, technically achieving forward motion but not the intended walking behavior. Similarly, "maximize torso height" could be achieved by the robot launching itself upward and immediately falling, producing a brief height peak before termination.

The HyperAgents evaluation mitigates this through the early termination condition (10-degree fall threshold) and the multi-episode evaluation with random speed commands. However, these mitigations are imperfect, and the robotics domain serves as a useful case study for the broader challenge of aligning self-improving AI systems with human intent. As the meta agent evolves increasingly sophisticated reward functions, the gap between optimizing the fitness metric and achieving the intended behavior may widen in unexpected ways.

This connects to the broader reward engineering research landscape, including RLHF (Reinforcement Learning from Human Feedback), reward modeling, and inverse reinforcement learning. The HyperAgents robotics domain demonstrates that self-improving agents can participate in reward design, but the alignment challenge, ensuring that optimized rewards correspond to desired behaviors, remains an open problem.

Connection to Broader Reward Design Research

The HyperAgents approach to reward design through LLM-based code generation represents a distinct paradigm in the reward engineering literature. Traditional approaches include manual reward engineering (expert-designed reward functions), inverse reinforcement learning (inferring rewards from demonstrations), and RLHF (learning rewards from human preference comparisons). The HyperAgents approach, generating reward functions as Python code from natural language descriptions, adds a new dimension: automated reward engineering through code generation and self-improvement.

This approach has several advantages over traditional methods. It produces interpretable reward functions (human-readable Python code), supports rapid iteration (new reward functions can be generated in seconds), and can leverage the vast programming knowledge embedded in LLMs. The self-improvement loop adds the ability to automatically refine reward engineering strategies based on empirical results, potentially discovering design patterns that human engineers would not consider.

However, it also inherits limitations of LLM-based code generation: sensitivity to prompting, potential for syntactically valid but semantically incorrect code, and dependence on the model's training data for reward design intuitions. The self-improvement process must navigate these challenges while exploring the vast space of possible reward functions.

Frequently Asked Questions

What physics simulator does HyperAgents use for robotics evaluation?

HyperAgents uses the Genesis physics simulator, which provides GPU-accelerated rigid-body simulation for robot learning. The robotics domain simulates a Go2 quadruped robot with 4096 parallel environments running simultaneously on a single GPU. Genesis was chosen for its speed (enabling rapid policy training during evaluation), accurate contact dynamics, and support for massively parallel simulation needed for PPO training.

How does HyperAgents evaluate reward functions for robotics?

Evaluation runs a 20-second simulation using a PPO-trained policy. Each episode lasts 4 seconds (200 steps at dt=0.02). Fitness scores range from 0 to 1, with early termination if the robot's roll or pitch exceeds 10 degrees (indicating a fall). Speed commands are randomly sampled per episode to test policy robustness across different locomotion targets. The full evaluation pipeline, from reward function generation through PPO training to policy testing, is orchestrated by the EvaluatorManager class.

What is the training vs test task for the robotics domain?

The training task is "walk forward" using the go2walking domain configuration. The meta agent optimizes the task agent's reward generation code based on walking performance. The test task is fundamentally different: zero-shot generation of reward functions to maximize torso height. This tests whether the agent has learned general reward engineering skills that transfer across objectives, rather than memorizing walking-specific reward templates.

How does HyperAgents handle GPU resources for robotics simulation?

The implementation includes a dedicated gpu_selector.py module for GPU resource management. Docker containers run with NVIDIA CUDA support using the nvidia/cuda base image, with configured CUDA library paths and EGL for headless rendering. This enables physics simulation and policy training on headless compute nodes without a display server, which is essential for automated evaluation during self-improvement iterations.

← Back to HyperAgents Overview