Introduction
A benchmark to measure the capabilities of AI agents in a terminal environment.
Terminal-Bench is a framework and set of tasks for evaluating how well AI agents can accomplish complex tasks in a terminal environment.
Examples of tasks include:
- Compiling and packaging a code repository
- Downloading a dataset and training a classifier on it
- Setting up a server
Terminal-Bench consists of two parts: a dataset of tasks and an execution harness.
Dataset of tasks
Each task in Terminal-Bench includes
- a description in English
- a docker environment
- a test script to verify if the agent completed the task successfully
- a reference ("oracle") solution that solves the task
Check out our existing tasks here.
Want to help?
T-Bench is an active open-source project. We're always looking for new tasks! If you have a task you'd like to see in T-Bench please follow our quickstart guide to add it, or join our Discord server to chat with us.
Execution harness
The execution harness connects a language model to a sandboxed terminal environment, sets up the task environments, and evaluates the agent's ability to complete the tasks.
After installing the package you can run the harness via tb run
.