terminal-bench

Introduction

A benchmark to measure the capabilities of AI agents in a terminal environment.

#####################################################################
#  _____                   _             _     ______________       #
# |_   _|__ _ __ _ __ ___ (_)_ __   __ _| |   ||            ||      #
#   | |/ _ \ '__| '_ ` _ \| | '_ \ / _` | |   || >          ||      #
#   | |  __/ |  | | | | | | | | | | (_| | |   ||            ||      #
#   |_|\___|_|  |_| |_| |_|_|_| |_|\__,_|_|   ||____________||      #
#   ____                  _                   |______________|      #
#  | __ )  ___ _ __   ___| |__                 \\############\\     #
#  |  _ \ / _ \ '_ \ / __| '_ \                 \\############\\    # 
#  | |_) |  __/ | | | (__| | | |                 \      ____    \   #
#  |____/ \___|_| |_|\___|_| |_|                  \_____\___\____\  #
#                                                                   #
#####################################################################

Terminal-Bench is a framework and set of tasks for evaluating how well AI agents can accomplish complex tasks in a terminal environment.

Examples of tasks include:

  • Compiling and packaging a code repository
  • Downloading a dataset and training a classifier on it
  • Setting up a server

Terminal-Bench consists of two parts: a dataset of tasks and an execution harness.

Dataset of tasks

Each task in Terminal-Bench includes

  • a description in English
  • a docker environment
  • a test script to verify if the agent completed the task successfully
  • a reference ("oracle") solution that solves the task

Check out our existing tasks here.

Want to help?

T-Bench is an active open-source project. We're always looking for new tasks! If you have a task you'd like to see in T-Bench please follow our quickstart guide to add it, or join our Discord server to chat with us.

Execution harness

The execution harness connects a language model to a sandboxed terminal environment, sets up the task environments, and evaluates the agent's ability to complete the tasks.

After installing the package you can run the harness via tb run.

On this page