Adapters
How to create a new adapter for a new benchmark.
The Terminal-Bench framework is quite flexible. Most tasks that can be expressed as an instruction, docker environment, and test script can be adapted to the Terminal-Bench framework. Our team is working to adapt many popular benchmarks to the Terminal-Bench framework such as SWE-Bench, AppWorld, EvoEval, DevBench, and more.
To create an adapter, reference an existing adapter in the adapters/
folder and follow that pattern. Ultimately, as long as the adapter runs and adapts the tasks correctly, that is all that matters. From there, please follow the instructions in our datasets guide for how to create a PR to add the task set to our registry.
Adding a new adapter
Creating the adapter
To create an adapter, reference an existing adapter in the adapters/
folder and follow that pattern.
We plan to standardize the adapter interface in the future.
Run the adapter
Run your adapter and ensure the tasks pass with the oracle solutions and evaluate the tasks using similar agents or models to those used in the original benchmark paper to ensure that the results are similar.
Create a PR to the registry repo
Once you have run the adapter and ensured the tasks pass with the oracle solutions, fork the registry repo and create a PR with the tasks.
In your PR description, please include the following information:
- The name of the dataset
- The link to the original dataset
- The link to the adapter
- The performance from the original benchmark paper
- The performance from the Terminal-Bench framework
The structure of your dataset folder should be as follows:
Register the dataset in the Terminal-Bench registry
After you create your PR to the registry, add an entry (or multiple) to the registry.json
file in the root of the Terminal-Bench repository (not the registry repo!).
For now, we recommend just adding the latest version of the dataset rather than a specific version.
Make sure to link your dataset PR in the registry PR description.