How to run Terminal-Bench 2.0
Run Terminal-Bench 2.0 with Harbor.
Terminal-Bench evaluates agents on terminal-based tasks. Harbor is the official harness for Terminal-Bench 2.0.
Prerequisites
Install Harbor and make sure Docker is running.
uv tool install harborharbor --helpSmoke test
Run the oracle solutions on the first five tasks. This confirms that Harbor can download the dataset and start the task containers.
harbor run -d terminal-bench/terminal-bench-2 -a oracle -l 5Run with an agent
harbor run \
-d terminal-bench/terminal-bench-2 \
-a claude-code \
-m anthropic/claude-haiku-4-5 \
-k 5To scale with a cloud sandbox, set the provider keys and pass --env.
export DAYTONA_API_KEY="<your-daytona-api-key>"
export ANTHROPIC_API_KEY="<your-anthropic-api-key>"
harbor run \
-d terminal-bench/terminal-bench-2 \
-a claude-code \
-m anthropic/claude-haiku-4-5 \
--env daytona \
-n 32 \
-k 5Custom agents
Pass an import path instead of a built-in agent name.
harbor run \
-d terminal-bench/terminal-bench-2 \
--agent-import-path "path.to.agent:SomeAgent" \
-k 5To run one task from the dataset, add --include-task-name.
harbor run \
-d terminal-bench/terminal-bench-2 \
-a claude-code \
-m anthropic/claude-haiku-4-5 \
--include-task-name "<task-name>"Submit to the leaderboard
A new leaderboard submission process is coming soon.
Results
View the Terminal-Bench 2.0 leaderboard or browse the Terminal-Bench 2.0 benchmark.