Introducing the Terminal-Bench Dataset Registry with SWE-Bench Verified, AppWorld, DevEval, and EvoEval

Today we’re announcing the Terminal-Bench registry: an easy way to evaluate agents on many agentic benchmarks via the Terminal-Bench framework. For benchmark developers, the Terminal-Bench registry also offers a way to distribute benchmarks to agent developers and run existing agents in a unified way.

Terminal-Bench-Core, our benchmark for evaluating agents in the command line, provides a comprehensive testbed for measuring agents on software engineering, scientific computing, system administration, and more. But benchmarks from other researchers, like SWE-Bench, AppWorld, DevEval, etc. help paint a broader picture of agent performance on specific domains.

Until now, evaluating agents on multiple benchmarks has been difficult and time-consuming. Each benchmark comes with its own repository, evaluation harness, dependencies, and Docker images. Testing a new agent or model means cloning each of these repositories and spending hours or days familiarizing yourself with the project's structure and getting it to run.

The Terminal-Bench registry solves this problem. Benchmark developers can now build new benchmarks using—or adapt existing ones to use—the Terminal-Bench harness, which provides logging, live streaming, popular agent integrations, parallelization, distribution via Terminal-Bench CLI, and, coming soon, cloud hosting out of the box.

To supplement the launch of the Terminal-Bench registry, we’ve adapted four popular benchmarks into the Terminal-Bench framework and registered them on the Terminal-Bench registry.

Benchmark	Description
SWE-Bench Verified	A well-known benchmark for evaluating LLMs on real-world software issues collected from GitHub. Given a codebase and an issue, a language model is tasked with generating a patch that resolves the described problem.
AppWorld	A suite of natural, diverse, and challenging autonomous agent tasks requiring rich and interactive code generation. It focuses on practical, multi-step operations that require interaction with various tools and APIs to achieve a goal.
DevEval	A benchmark designed to evaluate LLMs across various stages of the repo-level software development lifecycle, including software design, environment setup, implementation, acceptance testing, and unit testing.
EvoEval	"A program synthesis benchmark suite created by evolving existing benchmarks into different targeted domains for a comprehensive evaluation of LLM coding abilities."

The Terminal-Bench registry enables agent developers to integrate their agent into Terminal-Bench once and immediately start evaluating their agent on registered evals or even start building their own set of domain-specific in-house evals.

To view the dataset registry, run

tb datasets list # Make sure to pip install terminal-bench first

Then, for example, to run django__django-13658 from SWE-Bench Verified with Terminus and Claude 4 Sonnet, run

tb run \\
  --dataset swebench-verified \\
  --task-id django__django-13658 \\
  --agent terminus \\
  --model anthropic/claude-sonnet-4-20250514

which produces the following run.

We plan to regularly add more adapters for frontier benchmarks (suggestions are welcome!) and encourage the community to take advantage of the registry to develop and distribute new benchmarks.

Validating Benchmark Adaptations

Terminal-Bench adapters restructure existing benchmarks to use the Terminal-Bench task framework but do not alter the task contents. From the agent's perspective, a task in the adapted benchmark is the same instruction and environment and completing the task is determined by the same set of unit tests.

To ensure each benchmark has been adapted correctly, we run parity experiments using the benchmark's harness and the Terminal-Bench harness with the same agent, prompt, and model.

Terminal-Bench adapter resolved rates are comparable to those of the original benchmarks:

Agent	Model	Original SWE-Bench Grading Script*	Terminal-Bench Adapter
Terminus	`claude-4-opus`	66%	66%

Agent	Model	Original AppWorld**	Terminal-Bench Adapter
Claude Code	`claude-4-opus`	52.1% ± 1.8%	52.1% ± 1.8%

Agent	Model	Original DevEval	Terminal-Bench Adapter
Claude Code	`claude-4-opus`	22.8% ± 4.3%	21.6% ± 3.3%
Codex	`o4-mini`	22.4% ± 4.4%	23.6% ± 2.3%

Agent	Model	Original EvoEval	Terminal-Bench Adapter
Claude Code	`claude-4-opus`	65.8% ± 1.7%	66.4% ± 1.4%
Codex	`o4-mini`	67.1% ± 1.1%	66.4% ± 2.0%

Terminal-Bench will continue to require parity experiments for adapted benchmarks so agent developers can evaluate on the benchmarks using the Terminal-Bench harness with full confidence.

*Note that when SWE-bench was released, the task was to generate a patch file given context from the repo in a single model call. There is no first-party evaluation harness, but rather a script to grade patch files. To ensure our adaptation is correct, we ran Terminus using the Terminal-Bench harness, extracted the patches, and evaluated them using the SWE-bench grading script, which produced the same results as our harness.

** Post-publication AppWorld released a CLI for terminal-based agents to interact with the environment. With the support of the AppWorld authors, we extracted agent actions on this CLI from our harness runs and evaluated them using the original AppWorld grading script, which produced the same results as our harness.

For Benchmark Developers

The Terminal-Bench framework supports text-based agent benchmarks that can be represented as an instruction, docker environment, and test script.

Interested in making it easier for users to run your benchmark? Consider building an adapter.

Building a new benchmark? Use the Terminal-Bench framework to build (tb tasks create), quality check (tb tasks check), run (tb run), and register your benchmark to take advantage of the harness and logging infra so you can focus on building tasks and running experiments.

Planned Adapters

We plan to adapt the following benchmarks into the Terminal-Bench framework: MLE-Bench, SWE-Lancer, RE-Bench, BIX-Bench, The Agent Company, Research Bench, Cybench, and AlgoTune.

If you have a Terminal-Bench compatible benchmark you'd like to see adapted, please open an issue or let us know on Discord.