terminal-bench

How to run Terminal-Bench 2.0

Run Terminal-Bench 2.0 with Harbor.

Terminal-Bench evaluates agents on terminal-based tasks. Harbor is the official harness for Terminal-Bench 2.0.

Prerequisites

Install Harbor and make sure Docker is running.

uv tool install harbor
harbor --help

Smoke test

Run the oracle solutions on the first five tasks. This confirms that Harbor can download the dataset and start the task containers.

harbor run -d terminal-bench/terminal-bench-2 -a oracle -l 5

Run with an agent

harbor run \
  -d terminal-bench/terminal-bench-2 \
  -a claude-code \
  -m anthropic/claude-haiku-4-5 \
  -k 5

To scale with a cloud sandbox, set the provider keys and pass --env.

export DAYTONA_API_KEY="<your-daytona-api-key>"
export ANTHROPIC_API_KEY="<your-anthropic-api-key>"

harbor run \
  -d terminal-bench/terminal-bench-2 \
  -a claude-code \
  -m anthropic/claude-haiku-4-5 \
  --env daytona \
  -n 32 \
  -k 5

Custom agents

Pass an import path instead of a built-in agent name.

harbor run \
  -d terminal-bench/terminal-bench-2 \
  --agent-import-path "path.to.agent:SomeAgent" \
  -k 5

To run one task from the dataset, add --include-task-name.

harbor run \
  -d terminal-bench/terminal-bench-2 \
  -a claude-code \
  -m anthropic/claude-haiku-4-5 \
  --include-task-name "<task-name>"

Submit to the leaderboard

A new leaderboard submission process is coming soon.

Results

View the Terminal-Bench 2.0 leaderboard or browse the Terminal-Bench 2.0 benchmark.