terminal-bench

Adapters

How to create a new adapter for a new benchmark.

The Terminal-Bench framework is quite flexible. Most tasks that can be expressed as an instruction, docker environment, and test script can be adapted to the Terminal-Bench framework. Our team is working to adapt many popular benchmarks to the Terminal-Bench framework such as SWE-Bench, AppWorld, EvoEval, DevEval, and more. If you have a benchmark or a dataset of tasks that you want to adapt (e.g., using Terminal-Bench's tb run harness to evaluate), please follow the steps below to develop your adapter and get it merged into Terminal-Bench.

We'd love to help you adapt your benchmark!

If you are thinking about adapting your benchmark, please reach out to Lin Shi at ls2282@cornell.edu who can guide you through the process or join the #adapters-spam channel in our Discord.

Overview

Adapting a benchmark to Terminal-Bench is a straightforward process designed to ensure consistency and quality. This guide will walk you through everything you need to know. Here's a quick look at the steps:

  1. Fork and Prepare the Original Benchmark: First, you'll get familiar with the original benchmark, fork its repository, and run baseline experiments to establish performance metrics as baseline.
  2. Create the Terminal-Bench Adapter: Next, you'll write the adapter code that translates the original benchmark's tasks into the Terminal-Bench format. This is the core of the process and involves several substeps:
  3. Document and Submit: Finally, you'll document your adapter's usage and parity results in a README.md and submit your work through a pull request.

We'll break down each step in detail below. Let's get started!

The Adapter Development Workflow

Creating a high-quality adapter involves several key steps. Following this workflow ensures that the adapted benchmark is a faithful and reliable implementation of the original.

1. Fork and Prepare the Original Benchmark

Before writing any adapter code, it's crucial to deeply understand the original benchmark. Then you will need to fork the original benchmark's repository and write scripts to allow Terminal-Bench-comparable experiments. Specifically, you will need to:

  • Fork the original benchmark's repository: Create a new branch for your adaptation work (e.g., terminal-bench-adapter).

  • Implement agents and models: The ultimate goal for this step is to allow fair comparisons between the orignal benchmark's harness and that of Termianl-Bench. Therefore, you may either:

    • Implement some CLI agents and models that are supported by Terminal-Bench (CodeX, Claude-Code, Gemini-CLI, etc.) in the original benchmark, if there hasn't been any, or
    • Ignore this step and instead implement a custom agent in Terminal-Bench, following the agent tutorial.

    There is no restrictions on how many and which agents you choose. However, the comparison results should be sufficient to show that the Terminal-Bench adapter well-reproduces the original benchmark's harness.

  • Establish a baseline: Run the original benchmark with the agents and models that you want to compare to Terminal-Bench. We recommend running at least five times for each to get stable average scores and standard deviations. In terms of the scoring metric, for the ease of comparison to Terminal-Bench, we recommend using the resolved rate, i.e., the number of tasks that are resolved correctly over the total number of tasks.

  • Write scripts and documentation: Write scripts to allow Terminal-Bench-comparable experiments, and document the process in a README.md file.

Keep a link of the forked repo, and you are good to go!

2. Create the Terminal-Bench Adapter

With a solid baseline, you can now create the adapter itself within the terminal-bench repository. Here are the main steps:

2.1 Fork the Terminal Bench repository

Fork the Terminal Bench repository and create a new branch for your adapter (e.g., {adapter-name}-adapter).

git clone https://github.com/{your-github-username}/terminal-bench.git
cd terminal-bench
git checkout -b {your-adapter-name}-adapter

2.2 Develop the adapter code

Develop the adapter under /adapters/{adapter-name}. You may refer to the existing adapters in the adapters/ directory and follow the patterns. The adapter's primary job is to parse the original benchmark's data and generate task directories in the standard Terminal-Bench format. Here is an example architecture of the task directory:

task.yaml (Task description and configuration)
Dockerfile (Container environment for the task)
docker-compose.yaml (Docker compose configuration; usually use the default one)
run-tests.sh (Script to run the tests)
solution.sh (Oracle solution)
test_outputs.py (Evaluation script to allow pytest run)

Task templates or default files can be found at template-tasks. You may want to copy-paste and modify them for your adapter.

2.3 Requirements and Tips for the Adapter Code

Requirements and Recommendations:

  • Your adapter code is used to generate task directories. A recommended structure of the adapter code is as follows:

    • adapter.py: The main adapter code that handles task generation and other adapter-specific logic
    • run_adapter.py: The entry point of the adapter to create the task directories.
    • templates/: A directory to store the template files for the tasks.
    • parity_experiment.json: A file to store the parity experiment results, i.e., comparison between the original benchmark and the Terminal-Bench adapter. More details are provided in the Recording Parity Results section.
    • README.md: This is the last thing you would work on before the PR submission. More details are provided in the Document and Submit section.
  • By default, your adapter should create tasks in /tasks/<adapter-name>, but you should also allow users to specify a custom output path.

  • The adapter code should also support:

    • Temporarily cloning the source benchmark, preparing the tasks, and cleaning up the temporary clone.
    • Generating tasks from an existing, already-cloned benchmark repository without deleting it.
  • The templates directory stores the template files required for the tasks. For your reference, all files above or in template-tasks are recommended to be included in the templates directory. Then your adapter code would use the templates to generate the actual task directories.

  • You need to ensure that the oracle solution passes with a 100% resolved rate.

Tips:

2.4 Register the Dataset

Once your adapter correctly generates tasks and you verify the oracle solutions, you should add them to the official dataset repository.

  • Fork and clone the dataset repository:
    git clone https://github.com/{your-github-username}/terminal-bench-datasets.git
  • Add your tasks: Place the generated task directories under datasets/<your-adapter-name>/. For example, if you follow the adapter development instructions above correctly, you should be able to run the following example commands to add your tasks to the dataset repository:
    cd terminal-bench/adapters/<your-adapter-name>
    
    # Use uv
    uv run run_adapter.py --output-path /path/to/terminal-bench-datasets/datasets/<your-adapter-name>
    
    # Use Python
    python run_adapter.py --output-path /path/to/terminal-bench-datasets/datasets/<your-adapter-name>
  • Pull Request: Create a pull request to the terminal-bench-datasets repository. It's recommended to link the original benchmark's GitHub repository in your PR.

Then you should navigate to the terminal-bench repository (not the dataset repo!) and add a new entry to the registry.json file in the root. For example:

[
    // existing entries...
    {
        "name": "<dataset-name>", // The same as your adapter name
        "version": "head", // use "head" for the latest version
        "description": "<description>", // A brief description. Original benchmark: [URL]. Dataset PR: [URL]
        "terminal_bench_version": ">=0.2.4", // We recommend pinning to ">=0.2.4" to ensure compatibility with recent features.
        "github_url": "https://github.com/laude-institute/terminal-bench-datasets.git",
        "dataset_path": "datasets/<dataset-name>", // The same as your adapter name
        "branch": "main",
        "commit_hash": "head", // Use "head" for the latest commit
        "task_id_subset": null // Use null for the entire dataset
    }
]

You can optionally specify the subset of task IDs. For instance:

"task_id_subset": [
  "task-1", 
  "task-2", 
  "task-3",
  ...
]

Run the following command to ensure your registry is valid:

uv run tb datasets list --local-registry-path registry.json

Then, you may either create a new PR for the registry update, or wait until the adapter PR is created and include the registry update together with the adapter code and scripts.

2.5 Run Harness and Parity Experiments

After you have developed the adapter code, you can run the harness and parity experiments to verify the adapter.

There are two ways to run Terminal-Bench harness on your adapter:

  • Run the adapter code for task preparation; then run the harness by specifying --dataset-path using tb run. For example:
    tb run \
      --dataset-path /path/to/your/adapter/tasks \
      --agent oracle
    This is how you would run it locally when you develop the adapter.
  • Directly run the harness by specifying --dataset using tb run. For example:
    tb run \
      --dataset name==x.y.z \
      --agent claude-code \
      --model claude-opus-4-20250514
    This is how you would run it when the dataset has been added to the registry.

You should include the instructions for running both ways in the README.md for your adapter.

Parity Experiments: You need to run parity experiments to verify the adapter. Use the Terminal-Bench harness (tb run) to run the same agents and models you used in Step 1 against your adapter tasks. Run them five times each to gather average scores and standard deviations. These results should be comparable to the baseline (i.e., the scores you got in Step 1 using the original benchmark's harness).

2.6 Record Parity Results

To formally store and track the performance parity between the original benchmark and your adapter, create a parity_experiment.json file in your adapter's directory. A typical file would look like this:

[
  {
    "adapter_name": <adapter-name>,
    "agent": <agent-name>,
    "model": <model-name>,
    "date": <date>,
    "notes": <notes>, // e.g., "n tasks; averaged over 5 trials"
    "forked_repo": <forked-repo-link>,
    "adapter_pr": <adapter-pr-link>, // You can input the link AFTER the adapter PR is created.
    "dataset_pr": <dataset-pr-link>,
    "metrics": [
      {
        "benchmark_name": <original-benchmark-name>,
        "metric_name": <metric-name>,
        "value": <value>,
        "std_error": <std_error>
      },
      {
        "benchmark_name": "Terminal-Bench Adapter",
        "metric_name": <metric-name>,
        "value": <value>,
        "std_error": <std_error>
      }
    ]
  },
  ...
]

You would also need to include the parity experiment results in the README.md of your adapter. For example, you can add the following table:

| Agent | Model | Original Benchmark Performance | Terminal-Bench Adapter Performance |
|-------|-------|--------------------------------|------------------------------------|
| claude-code | claude-4-opus | Score ± Std | Score ± Std |
| codex | o4-mini | Score ± Std | Score ± Std |
| ... | ... | ... | ... |

and then include the following links:

  • The link to the original benchmark's GitHub repository.
  • The link to the forked repo in Step 1.
  • The link to the dataset PR in Step 2.4.
  • [Optional] The link to the adapter PR, which you can add with another commit AFTER the adapter PR is created.

3. Document and Submit

In the README.md of your adapter, you should include but not limited to the following information:

  • Adapter overview: A brief description of the adapter, including the link and introduction of the original benchmark, the adapter structure, and dataset link.
  • Parity experiment: The parity experiment results, showing comparable scores between the original benchmark and the Terminal-Bench harness (Step 2.6). Link to the forked repo in Step 1 to show how to reproduce/run experiments using the original harness.
  • Harness using terminal-bench: Instructions on how to run the harness using tb run, as specified in Step 2.5.
  • Usage of the adapter: Instructions on how to use the adapter to create tasks, as specified in Step 2.3.
  • Adaptation details: Relevant processes and notes on the adaptation (e.g., prompt modifications, evaluation metrics, file copies, etc.).

Finally, draft a PR in the main terminal-bench repository that includes your adapter code, template files, parity_experiment.json, documentations, and the updated registry.json. You can create the PR with placeholder links first, and then update the README.md and parity_experiment.json files with the actual PR link in a subsequent commit.

4. Other Useful Resources

  • The task creation tutorial walks through creating a simple Terminal-Bench task, which helps you understand the structure of a Terminal-Bench task before creating your adapters.
  • The datasets guide provides instructions on how to create a new task set and add to our registry.
  • The agent tutorial provides instructions on how to create and use your customized agent in Terminal Bench.

5. Getting Help

Thank you for your interest in Terminal-Bench and building an adapter! If you have any questions, please ask in the #adapters-spam channel in our Discord.