Introducing Terminal-Bench

Today we are releasing the first version of Terminal-Bench, a benchmark to quantify how well AI agents perform complex tasks in the terminal.

Terminal environments such as the Linux command line have decades of history in computing and the terminal has quickly become a standard tool for both coding agents and general-purpose AI systems. The terminal offers AI agents a versatile text-based interface for using computers. The text-based nature of terminals is especially useful for current AI agents because they rely on language models that perform best in the text domain.

However, current agents still fail to exploit the full power of the terminal in ways we outline in this post.

Terminal-Bench provides agent developers with tools and data to evaluate their agent's ability to use the terminal. We hope Terminal-Bench will serve as a compass towards increasingly better terminal usage by agents, and we are excited to further develop Terminal-Bench with the open source community.

Task Resolution Rate on T-Bench-Core-v0

Task resolution rate for various agent-model combinations. Each agent was given exactly one attempt to resolve each task in the T-Bench-Core-v0 dataset. For most agents, accuracy was averaged across multiple runs and the statistical certainty is represented by the error bars.

Please reach out to us on Discord to get your agent added to the leaderboard.

Motivation

Many popular AI agents (Cursor, Manus, Devin, Goose, OpenHands, Codex) use the terminal to perform actions. This is not surprising, since the terminal is:

Text-based. It natively matches the primary training modality of language models.
Powerful. Terminal interfaces provide fine-grained control over underlying resources with flexible, (usually) well-documented utilities.
Easily sandboxed. Terminals can be deployed in containerized environments to safely limit access to sensitive resources.
Useful in the real world. Consider the legendary software engineer that never leaves vim (sorry emacs).

However, despite their potential, agents in the terminal still frequently fail. They struggle to chain together multiple actions, reason over long contexts, act independently within their limits, and safely execute sensitive tasks without human supervision.

Building better agents for the terminal is a major research challenge. Terminal-Bench provides a starting point to address these limitations by providing a high-quality benchmark and evaluation framework for capable and safe agents in the terminal.

The Benchmark (Terminal-Bench-Core)

Terminal-Bench-Core is a growing dataset of challenging, hand-crafted, and human-verified tasks for agents in the terminal. Each task comes with a dedicated Docker environment, human-verified solution, and set of test cases to check the agent's solution. As of launch we've created 80 tasks, and we're adding more every week. In the coming weeks and months we will launch several hundred more tasks.

Terminal-Bench-Core tasks cover diverse behaviors in the terminal including scientific workflows, configuring networks, playing games, data analysis, calling APIs, and addressing cybersecurity vulnerabilities. Our first release is Terminal-Bench-Core-v0, which is a set of 80 tasks designed to cover a variety of terminal use cases. Browse all of our tasks here. The general structure of a Terminal-Bench-Core task is shown below.

Dockerfile (or docker-compose.yaml)

task.yaml

solution.sh (or solution.yaml)

run-tests.sh

test_outputs.py

other dependencies...

Task Examples

New Encrypt Command

In this (easy) task the agent is tested on its ability to use new tools. It's provided with a custom encryption cli utility and must use the provided manual to understand how to accomplish its task. The agent successfully uses the --help flag to learn the tool's API and achieves its goal.

View the task details. Thank you to Jeffrey Li for contributing this task.

Build Linux Kernel with QEMU

Here, Terminus (with GPT 4.1) is asked to build a linux kernel from source. It decides on its own to install and use nano for the task, but ultimately fails to properly use the interactive session and gives up rather than starting over with a different tool.

View the task details. Thank you to Nicholas Carlini for contributing this task.

Raman Fitting

In this task, Terminus is asked to carry out a standard scientific computing task. The agent needs to fit two known peaks in a Raman spectrum (a light intensity spectrum), but the values in the file are in wavelengths (nm) rather than the usual wavenumbers (cm⁻¹). Despite this, the agent still tries to fit the peaks at the expected positions for wavenumbers. The fit succeeds, but the numbers returned are nonsensical. The agent asserts that it is performing a sanity check on the results, but it fails to recognise the obviously incorrect fit parameters, such as negative amplitudes or extremely large peak widths.

View the task details. Thank you to Jan-Lucas Uslu for contributing this task.

How to Contribute

Do you have a task you'd like to evaluate agents on? If so, join our community to contribute to Terminal-Bench Core. We're actively looking for collaborators to help us establish the state of the art in terminal-use agents. Check out our quickstart guide for more.

The Evaluation Framework (Terminal-Bench Harness)

We wanted our infrastructure to make it easy to:

specify new instances
test and understand existing agents' performance of these tasks
manage flexible, complex task environments and lifecycles
someday train agents within Terminal-Bench

We also wanted to assume that agents would be sandboxed given their current unreliability. For these reasons, the Terminal-Bench harness handles orchestrating agents, spinning up multi-container docker environments, logging agent actions, and verifying container state.

Terminus

Since some terminal agents do not support arbitrary language models (e.g. Claude Code) we designed our own agent, Terminus, to serve as a neutral test-bed. Terminus is intentionally simple so as not to bias performance towards one language model or another - it provides no tools other than a tmux pane for the language models to interact with by sending keystrokes.

Other Agents

The harness supports three forms of agent integration to make it as easy as possible to run evaluations:

Container installation (quickest). We support installing the agents directly into the task environment. Note that the agent will fail tasks with realistic environments that are not amenable to installation.
Direct integrations (most robust). Any agent with a Python interface (such as Terminus) can be directly integrated into the Terminal-Bench Harness to gain complete access to our logging and API calling infrastructures. Agents implemented with this approach can succeed in task realistic environments where installing the agent would be challenging (e.g. antiquated python installations or broken networking).
MCP Server. The Terminal-Bench harness is available via an MCP server which exposes a TMux session to the agent under evaluation. This method allows easy integrations of MCP Clients (like Goose) but limits the agent to tools defined by Terminal-Bench. Supporting arbitrary tools over MCP is part of our roadmap, stay tuned for updates.

For Agent Developers

We are happy to work with you to find the best way to evaluate on Terminal-Bench. Drop us a message in our Discord or raise an issue on GitHub to learn more, or check out our docs.

Terminal-Bench Integrations

We love agent evaluations so much that we couldn't stop at Terminal-Bench-Core. Our harness can act as an interface with other agent benchmarks, making it trivial to evaluate on arbitrary computer use tasks. To illustrate this we have already ported the AppWorld benchmark into Terminal-Bench and will add SWE-Bench (and many others) soon.

For Evaluation Researchers

We'd love to help you port your benchmark to Terminal-Bench. Visit our Discord to chat with us about the best way to make your benchmark available to agent developers.

Future Possibilities and Roadmap

We're just getting started. Over the next few months we plan to:

Support massively parallel agent evaluations in the cloud
Build training infrastructure to support RL and rollout generation within Terminal-Bench
Task subsets around specific themes, including:
- Environments where installation is hard (remote environments)
- Interactive task (e.g. games, vim)
- Web development
- Personal assistant tasks
Improved performance for installed agents
Support (V)LM-as-a-judge
Adapters to support evaluating other benchmarks through our harness:
- AppWorld (finished)
- SWE-Bench (in progress)
- Dev-Bench (in progress)
- MLE-Bench
- SWE-Lancer
- RE-Bench
Continue to expand Terminal-Bench Core
Develop a secret test set of diverse tasks

Want to help? Join our Discord 😈