terminal-bench

How to run Terminal-Bench 2.1

Run Terminal-Bench 2.1 with Harbor.

Terminal-Bench 2.1 is run through Harbor with the terminal-bench/terminal-bench-2-1 dataset.

Prerequisites

Install Harbor and make sure Docker is running.

uv tool install harbor
harbor --help

Smoke test

Run the oracle solutions on the first five tasks. This confirms that Harbor can download the dataset and start the task containers.

harbor run -d terminal-bench/terminal-bench-2-1 -a oracle -l 5

Run with an agent

harbor run \
  -d terminal-bench/terminal-bench-2-1 \
  -a claude-code \
  -m anthropic/claude-opus-4-1 \
  -k 5

To scale with a cloud sandbox, set the provider keys and pass --env.

export DAYTONA_API_KEY="<your-daytona-api-key>"
export ANTHROPIC_API_KEY="<your-anthropic-api-key>"

harbor run \
  -d terminal-bench/terminal-bench-2-1 \
  -a claude-code \
  -m anthropic/claude-opus-4-1 \
  --env daytona \
  -n 32 \
  -k 5

Custom agents

Pass an import path instead of a built-in agent name.

harbor run \
  -d terminal-bench/terminal-bench-2-1 \
  --agent-import-path "path.to.agent:SomeAgent" \
  -k 5

To run one task from the dataset, add --include-task-name.

harbor run \
  -d terminal-bench/terminal-bench-2-1 \
  -a claude-code \
  -m anthropic/claude-opus-4-1 \
  --include-task-name "<task-name>"

Submit to the leaderboard

A new leaderboard submission process is coming soon.

Results

View the Terminal-Bench 2.1 leaderboard, browse the Terminal-Bench 2.1 benchmark, or open the dataset on Harbor Hub.