HyperAgents on Polyglot Programming Tasks

Published · Updated · Based on arXiv:2603.19461

The polyglot programming domain is one of four evaluation domains in the HyperAgents framework. It tests whether self-improving AI agents can generate correct code patches across six programming languages, evaluated under strict pass@1 conditions with no test feedback visible to the agent. This domain offers the strongest objective feedback signal of all HyperAgents evaluation domains, making it an ideal testbed for measuring genuine self-improvement in code generation.

Key Points

Overview of the Polyglot Code Domain

Within the HyperAgents research framework developed by Jenny Zhang and colleagues at Meta, the polyglot programming domain provides a rigorous testbed for evaluating AI self-improvement on code generation. The task format is straightforward: given a code repository and a natural language instruction, the agent must produce a code patch that passes an associated test suite. Unlike interactive coding assistants that receive iterative feedback, the HyperAgents polyglot evaluation provides no ground-truth test output to the agent during generation, requiring the system to internalize programming patterns and testing conventions across multiple languages.

This domain is particularly significant because code generation provides an objective, binary feedback signal: either the tests pass or they do not. This clarity of evaluation makes it an excellent substrate for self-improvement, as the meta agent can directly correlate its modifications with measurable performance changes. The polyglot aspect adds further complexity, demanding that agents generalize across language-specific idioms, build systems, and testing frameworks.

The research team, led by Jenny Zhang, Bingchen Zhao, Wannan Yang, and others at Meta, selected Exercism repositories as the task source. Exercism is a well-known programming practice platform that provides structured exercises with test suites in dozens of languages. For the HyperAgents benchmark, the team curated tasks from six languages, ensuring a diverse and representative evaluation surface. The full details are documented in the paper published as arXiv:2603.19461 (March 2026).

Task Scale and Structure

The polyglot benchmark is split into 60 training tasks and 165 test tasks. Training tasks are used during the self-improvement loop: the meta agent observes how its modifications to the task agent affect performance on these tasks and uses that feedback to guide further optimization. The 165 test tasks are held out for final evaluation, measuring whether improvements generalize beyond the training distribution.

Each task consists of a repository snapshot containing source files, configuration files (such as Cargo.toml for Rust or package.json for JavaScript), and a test suite. The agent receives the repository contents along with a natural language instruction describing what the code should do. It must then produce a patch, typically modifying or creating source files, that causes all tests to pass.

To manage computational costs during development and experimentation, the benchmark provides staged evaluation subsets defined in subsets/small.json and subsets/medium.json. These subsets allow researchers to run quick validation on a smaller set of tasks before committing to a full evaluation run, which can be time-intensive given the Docker-based execution pipeline.

Language-Specific Test Commands

Each supported language uses its native testing infrastructure. The HyperAgents implementation defines specific test commands for each language, carefully configured to capture comprehensive results:

Language Test Command Notes
Python pytest --tb=short -rA Full reports with all test outcomes
Rust cargo test -- --include-ignored Includes tests marked #[ignore]
Go go test ./... Recursive test discovery across packages
JavaScript npm test With xtest replacement to enable skipped tests
C++ cmake . && make && ctest CMake build pipeline followed by test execution
Java gradle test Standard Gradle test task execution

A notable detail is the JavaScript handling: the evaluation harness replaces xtest calls (which Jest treats as skipped tests) with test calls, ensuring that all test cases actually execute. Similarly, Rust tests marked with #[ignore] are explicitly included. These decisions reflect the research team's goal of comprehensive, no-shortcuts evaluation, ensuring that the agent's code must satisfy every test case, not just the non-skipped subset.

The Evaluation Metric: Strict pass@1

The HyperAgents polyglot domain uses pass@1 as its primary evaluation metric. This means the agent gets exactly one attempt to produce a correct patch. If the generated code does not pass all tests on the first try, the task is marked as failed. Critically, the agent does not receive any test output or feedback during generation. It cannot iteratively debug against test failures, as some interactive coding benchmarks allow.

This is notably stricter than the pass@2 metric used on some public coding leaderboards, where agents may generate two candidate solutions and succeed if either one passes. The pass@1 constraint better reflects real-world deployment scenarios where a self-improving system must produce reliable code modifications without human-in-the-loop debugging. It also provides a cleaner signal for measuring self-improvement: gains in pass@1 directly indicate that the system has learned to generate more reliable code on the first attempt.

Implementation Architecture

The polyglot domain implementation lives in the polyglot/ directory of the HyperAgents repository. The codebase is organized around several key modules that handle different aspects of the benchmarking pipeline:

Core Modules

benchmark.py (approximately 32KB) contains the core benchmarking logic. This is the largest module in the polyglot domain and orchestrates the end-to-end evaluation pipeline: loading task specifications, invoking the agent, collecting generated patches, and running tests inside Docker containers. It handles retry logic, timeout management, and result aggregation across all six languages.

harness.py implements the evaluation harness that wraps around the benchmarking logic, providing a standardized interface for the HyperAgents framework to invoke polyglot evaluations. It translates between the framework's generic evaluation API and the language-specific test execution paths.

run_evaluation.py is the main entry point for running evaluations. It parses command-line arguments, configures the evaluation environment, and orchestrates the full evaluation run including result serialization.

Docker Infrastructure

docker_build.py builds Docker images for each programming language. The build process is staged into three layers:

This layered approach minimizes rebuild times: modifying a single task only requires rebuilding the instance layer, while the base and environment layers are cached. docker_utils.py provides container management utilities including lifecycle management, output capture, and resource cleanup.

Test Specifications

test_spec.py defines the test specifications for each task, mapping task identifiers to their repository structure, test commands, and expected behavior. constants.py contains the crucial MAP_REPO_VERSION_TO_SPECS mapping, which associates each Exercism repository version with its specific build configuration, test command, and language-specific setup requirements.

# Simplified view of the evaluation flow from polyglot.benchmark import run_benchmark from polyglot.docker_build import build_images from polyglot.test_spec import load_specs # Build Docker images for all languages build_images(languages=["python", "rust", "go", "js", "cpp", "java"]) # Load task specifications specs = load_specs("subsets/medium.json") # Run evaluation with pass@1 results = run_benchmark(specs, attempts=1, show_test_output=False)

Docker-Based Isolation

Security and reproducibility are first-class concerns in the polyglot evaluation pipeline. Since the agent generates arbitrary code that gets executed, the system must prevent generated code from escaping its sandbox, accessing the host filesystem, or interfering with other evaluation tasks running concurrently.

Each evaluation task runs inside its own Docker container with restricted network access, limited filesystem mounts, and constrained resource allocations. The three-layer image architecture (base, environment, instance) ensures that task dependencies are pre-installed and version-locked, eliminating non-determinism from dependency resolution during test execution.

This isolation model also enables parallel evaluation: multiple tasks across different languages can run simultaneously in separate containers, significantly reducing total evaluation time. The docker_utils.py module manages container lifecycle, including graceful shutdown and cleanup to prevent resource leaks during long evaluation runs.

Connection to DGM and Self-Improvement

The polyglot domain has a direct lineage from the original Darwin Godel Machine (DGM) research. In the DGM paper, the system achieved improvements from 14.2% to 30.7% on polyglot programming tasks, demonstrating that automated self-improvement could meaningfully boost code generation capabilities. DGM-H, the HyperAgents variant, extends this approach by enabling the meta agent to modify not only the task agent but also itself, creating a fully self-referential improvement loop.

The code domain is particularly well-suited for self-improvement because of its strong, objective feedback signal. Unlike domains such as paper review (where "correctness" is subjective), code either passes its tests or it does not. This binary outcome allows the meta agent to make confident assessments about whether its modifications helped or hurt, enabling more aggressive optimization strategies.

The connection to SWE-bench further contextualizes these results. On SWE-bench, DGM improved from 20% to 50% resolve rate, demonstrating that self-improvement techniques transfer to real-world software engineering tasks beyond the structured Exercism exercises. The HyperAgents framework generalizes these findings across its four evaluation domains, using polyglot as the primary code-domain benchmark.

Staged Evaluation and Subsets

Full evaluation of 225 tasks across six languages is computationally expensive, especially when running multiple self-improvement iterations. To address this, the benchmark provides subset configurations:

These subsets are curated using curate_subsets.py to ensure representative coverage across languages and difficulty levels. During self-improvement, the meta agent typically evaluates on a subset and only performs full-scale evaluation for final reporting.

Data Contamination Considerations

Important Caveat

Exercism repositories are publicly available and may appear in LLM pretraining data. This means that some performance on polyglot tasks could reflect memorization rather than genuine code generation ability. The HyperAgents team acknowledges this risk, and the evaluation design (withholding test output, using pass@1) partially mitigates it by requiring the agent to produce correct solutions without iterative debugging against known tests.

The potential for training data contamination is a well-known challenge in code generation benchmarks. Because Exercism exercises are open source and widely used in programming education, large language models may have encountered solutions during pretraining. This does not invalidate the self-improvement results, as the key measurement is the delta between the initial agent and the self-improved agent, but it does affect the interpretation of absolute performance numbers.

Researchers using the HyperAgents polyglot benchmark should be aware of this limitation and consider supplementary evaluations on private or newly created test sets when making claims about generalization capability.

How Self-Improvement Works in the Code Domain

The self-improvement loop in the polyglot domain follows the general HyperAgents architecture. The meta agent observes the task agent's performance on training tasks, proposes modifications to the task agent's code (or its own code, in the DGM-H variant), executes the modified agent on the same or different tasks, and retains modifications that improve performance.

In the code domain, this process benefits from several properties unique to programming tasks:

The research team reports that the meta agent discovers strategies such as improved prompt formatting, better code structuring conventions, and more effective use of language-specific idioms, all without explicit human guidance on these techniques.

Frequently Asked Questions

What programming languages does the HyperAgents polyglot benchmark cover?

The benchmark covers six languages: Python (evaluated with pytest), Rust (cargo test with ignored tests included), Go (go test ./...), JavaScript (npm test with xtest replacement), C++ (CMake + make build pipeline), and Java (Gradle test). All tasks are sourced from Exercism repositories, providing a standardized and well-maintained exercise format across languages.

How does HyperAgents evaluate code correctness in the polyglot domain?

HyperAgents uses a strict pass@1 metric. The agent generates a single code patch, and it must pass all tests on the first attempt. Crucially, no ground-truth test feedback is visible to the agent during generation, making this stricter than the leaderboard pass@2 metric used in some other code benchmarks. This design choice better reflects real-world deployment where self-improving agents must produce reliable code without iterative debugging.

Why does HyperAgents use Docker for polyglot code evaluation?

Docker provides sandboxed, reproducible evaluation environments essential for safe execution of agent-generated code. Each language has its own three-layer image stack (base, environment, instance), ensuring dependency isolation, consistent toolchain versions, and protection against generated code escaping the sandbox. This also enables parallel evaluation across languages and tasks.

How did DGM-H improve over DGM on polyglot programming tasks?

The original DGM achieved improvements from 14.2% to 30.7% on polyglot tasks. DGM-H, the HyperAgents architecture, extends this by enabling fully self-referential modification where the meta agent can optimize both the task agent and itself. This creates a deeper optimization loop that discovers novel code generation strategies beyond what hand-crafted meta agents can achieve. The paper (arXiv:2603.19461) by Jenny Zhang et al. provides full quantitative results.

← Back to HyperAgents Overview