terminal-bench

Adapters

How to create a new adapter for a new benchmark.

The Terminal-Bench framework is quite flexible. Most tasks that can be expressed as an instruction, docker environment, and test script can be adapted to the Terminal-Bench framework. Our team is working to adapt many popular benchmarks to the Terminal-Bench framework such as SWE-Bench, AppWorld, EvoEval, DevBench, and more.

To create an adapter, reference an existing adapter in the adapters/ folder and follow that pattern. Ultimately, as long as the adapter runs and adapts the tasks correctly, that is all that matters. From there, please follow the instructions in our datasets guide for how to create a PR to add the task set to our registry.

Adding a new adapter

Creating the adapter

To create an adapter, reference an existing adapter in the adapters/ folder and follow that pattern.

We plan to standardize the adapter interface in the future.

Run the adapter

Run your adapter and ensure the tasks pass with the oracle solutions and evaluate the tasks using similar agents or models to those used in the original benchmark paper to ensure that the results are similar.

Create a PR to the registry repo

Once you have run the adapter and ensured the tasks pass with the oracle solutions, fork the registry repo and create a PR with the tasks.

git clone https://github.com/laude-institute/terminal-bench-datasets.git

In your PR description, please include the following information:

  • The name of the dataset
  • The link to the original dataset
  • The link to the adapter
  • The performance from the original benchmark paper
  • The performance from the Terminal-Bench framework

The structure of your dataset folder should be as follows:

Register the dataset in the Terminal-Bench registry

After you create your PR to the registry, add an entry (or multiple) to the registry.json file in the root of the Terminal-Bench repository (not the registry repo!).

For now, we recommend just adding the latest version of the dataset rather than a specific version.

[
    // existing entries...
    {
        "name": "<dataset-name>",
        "version": "head", // use "head" for the latest version
        "description": "<description>",
        "terminal_bench_version": "latest",
        "github_url": "https://github.com/laude-institute/terminal-bench-datasets.git",
        "dataset_path": "datasets/<dataset-name>",
        "branch": "main",
        "commit_hash": "head", // use "head" for the latest commit
        "task_id_subset": null // use null for the entire dataset
    }
]

Make sure to link your dataset PR in the registry PR description.

On this page